Detecting Neanderthal Introgression with Deep Learning

Nikolay Oskolkov, SciLifeLab, NBIS Long Term Support, nikolay.oskolkov@scilifelab.se

Abstract

In this notebook, I will demonstrate how we can use Deep Learning for calling regions of Neanderthal introgression in modern humans.

Table of Contents:

Extract Regions of Neanderthal Introgression

Since the revolutionary draft of Neanderthal genome https://science.sciencemag.org/content/328/5979/710 paper, a few studies had been carried out to systematically establish the maps of regions of Neanderthal introgression into modern humans. Those regions had were hypothesized to have multiple functional effects related to skin color, male fertility etc.

In [3]:
from IPython.display import Image
Path = '/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/'
Image(Path + 'neanderthal_dna.png', width=2000)
Out[3]:

Since both Reich and Akey did the 1000G Neanderthal Introgression calls using hg19 version of human reference genome, we downloaded the hg19 from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/ and prepared it for fast sequence extraction with samtools faidx. Let us test it and print the sequence of the SLC16A11 gene which is a gene inherited from Neandertals and known to have strong association with Type 2 Diabetes (T2D).

In [5]:
%%bash
cd /home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/
samtools faidx hg19.fa.gz chr17:6944941-6947411 | sed '1d' | tr -d '\n'
TTCAGGAAAACCCGATAAAAATTCTTTATTGGGGGAGGGGCTCAAACAAGAAAATAATCAACAAGTGGTGTCCAGAGTGGAGCCAGGGCCTCCTGGGGACAGCAAGACTGCCTGGGGAGCGGGAAGCAGCTCCCCCGTCTCTGGGGGAGGCGTGGCTGGAGGGGAGGCTGGACCACAGGAGGGCAGCGCCCTGGGCAACCCTATGTAGATGAAGCTGCCGGAGAGGATCAAAGAACCAGACAGGAGGAAAGAGGCGGTGAAGTCTCCTGTCTCATCCCTTAGGAAGCCTGAGGAGATGGGTAAGGGCATTTAGAAGCCTCGAACCCCAGGGGCAGAGGATAGTTGTAGGCAGATCTGTGAGCTCAGGTCCTTACCTGACAGGGGAGGGCCCAGGAGCCCCCCGAGGCTCATCAGCATCATCACCAGCCCTGTGGCCTGCACCACACCTCCGACGCCCACCAGCCCGGGGAGTACACCGAAAACCAGCGGGGCGTAACTCCCCGCGCTCAGCCCATAGGCCACAGCCGCGGCCAGCAGGGGACCCCCCCAGCTCTCTTCGCCGCCCACCACGGGCACCAGCCCCACCACCCACAGCCCCAGCCCAGTCAGAGCCCCGAATACGGCCAGCAGCCGCGGGAGGGGCACCCAGCCTTGGTCTGCCAGCCACCCGCAGACCAGCCGGGCGCCCGCATCCCCCATCGCAGCCACGGCCACCACCAGCGCTGCTCCGTATCCCCCCAGGCCCCGGTCTAAAGCGTGGGGAGCCAAGTGCACGTAAGGAACGAAGTACCCGCCCCCAACCAGGGCTGTGCCTAGAGCAAAGATTGAGAAGGCCCGGCGTGTGAACAGACTCAGGCCGAGGGCAGCTAGGGGACTACGCGGTGGGGCTGGGGGGTCTCCAGGAAGGACCAGGGGTAGCAGCAGGGCGCCACAGGGGGTGAGGTGGAGGGTGATCGCGCCGAGGAGGAGCAGAGCGCCCCGCCAGCCGAAAGTATCGAGAAGAAGCTGCAAGGCGGGCGCCAGGAGCAGCGAGGAGGCCCCGTTGCCGGTGAGCGCCAGCCCCACCGCCAAGACTCGACGGCGGGAGAAGTAACGCGAGAGGGTGCCTAGGGCGGGGGCGAACACCAGGGCCCAACCAAAGCCTGCGAATGAATAGGAGGGGATGGGGGCCGGCACTGGGGACGCCCGCCCCAGCATTCCCAGCCCGGCTCTCCGCACCAGGCCCCCGCCTCGTTCGCTACCCCAGATCCCAACAAGCTCCTGTCACCTCCTTCACCCTGAATGACCCGGGCATCCCACTTCCCTCACCAGCGAGGAGGCCCAGGCCGAGGTAGAGATGCAGCAGATCGCTGGCGAAAGCCGAGAAGACGAAGCCCAGCGAGGCGAGGACGCCCCCAACCATCACCACGGGGCGGGCCCCCCAGCGCGTGCTCAGGGCGCTGCCCACGGGGCCTGAAAGGGGGCGGAGTCAACGGAAGACACGCCCCCGGGCCCCCAAACTCTCTCCAACACTTCTCATTGGCTGTTTGCCTCCCTCGAACTAATGGTTCTTCTCCTTGACCTCAAGATGCACTCTTTTCTAGAGCCGGTTGCCCTTTCTCAGGTTGTCGGGTAATCAGGTGGGTCTCCCATCCCTCCACTCACGGCTCCTCCTCCCCGTTCCTGTCTCCCGCCTCAGGGCCCCCCTCACTGGCTGCCTGCTGCACGGCCAGGGCCAGGGCGCTGATCCACGCAGTGTCCTGGGCGCTTCGGTCAAAGTGCTCGGCAAGGTCAGGGAAGGCAAGGCCCAGCGAGCGCAGCAGCCCGTAGGACAGCCCGTTTATCGCGAAGGCTGCGGCCGCCACCACCCAGCCCCAGCCCCCATCCGGGGGTCCGGCGGGCTGGGGGGTCATCGCCGTCTGCGGGGTGGGGAAACATCTGTGAGAGAAGCCTCCACGCCTGTGCTTCCGCTGGGGAGCTGGCATCCCTGAGATCCAGCCTCTGGCTGCTCGCCCGGGGTGGGCCCGTCACTCCGAGCGCGATGGCGCGGGGCGGAGGGAGCCGGAGCCTGGCCTAGGGCTGGAACTCCCGGGTCCGCGCGAGGTACGGGGACGGGGACAGCCAGATCCCCAGGCCCGAGGTTCCCCGTCCCACGCACACTCCTTCCCCCTTACCCGACCTCTCCGGGCGGTGCGGGGAGGGGAAGGGTGAGGAAGGGCTGGGCCCGGCTTTCTCTCTGCTTCCCAGGCGGGCGGGGGCCCCGAAGGGGAGCGAGGGCAGCGATGGAGCCCAACTTGGACGGGCTCTCTCGGTAAACAGAGATCACCACAGGGGCCTGAGCCACCTGGGACGCGGGGTGTACGGAGGGCGGGGGCGAGCTGAAACAGCCAATCCGGCAAGCCGCGCGTGAGGCCAGACGCCAGTCCCACGGGGATTGAATTATGTACACACACAACCCCCAGGCCCGGGGGAGGGGGGAGCGGCGAGAAATG

We can also write a simple Python script that extracts sequences from the fasta-file from certain regions. Again, we extract the SLC16A11 sequance and compare it with the sequence extracted with samtools:

In [3]:
import gzip
def getfasta(chr, start, end):
    Path = '/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/'
    start_reading = 0
    line_to_print = []
    with gzip.open(Path + 'hg19.fa.gz', 'rb') as fin:
        for line in fin:
            line = line.decode('utf-8').rstrip()
            if start_reading == 1 and line[0] != '>':
                line_to_print.append(line)
            if line == '>chr' + str(chr):
                start_reading = 1
    print('\x1b[31m' + str(''.join(line_to_print)[start-1:end]) + '\x1b[0m')
getfasta(17, 6944941, 6947411)
TTCAGGAAAACCCGATAAAAATTCTTTATTGGGGGAGGGGCTCAAACAAGAAAATAATCAACAAGTGGTGTCCAGAGTGGAGCCAGGGCCTCCTGGGGACAGCAAGACTGCCTGGGGAGCGGGAAGCAGCTCCCCCGTCTCTGGGGGAGGCGTGGCTGGAGGGGAGGCTGGACCACAGGAGGGCAGCGCCCTGGGCAACCCTATGTAGATGAAGCTGCCGGAGAGGATCAAAGAACCAGACAGGAGGAAAGAGGCGGTGAAGTCTCCTGTCTCATCCCTTAGGAAGCCTGAGGAGATGGGTAAGGGCATTTAGAAGCCTCGAACCCCAGGGGCAGAGGATAGTTGTAGGCAGATCTGTGAGCTCAGGTCCTTACCTGACAGGGGAGGGCCCAGGAGCCCCCCGAGGCTCATCAGCATCATCACCAGCCCTGTGGCCTGCACCACACCTCCGACGCCCACCAGCCCGGGGAGTACACCGAAAACCAGCGGGGCGTAACTCCCCGCGCTCAGCCCATAGGCCACAGCCGCGGCCAGCAGGGGACCCCCCCAGCTCTCTTCGCCGCCCACCACGGGCACCAGCCCCACCACCCACAGCCCCAGCCCAGTCAGAGCCCCGAATACGGCCAGCAGCCGCGGGAGGGGCACCCAGCCTTGGTCTGCCAGCCACCCGCAGACCAGCCGGGCGCCCGCATCCCCCATCGCAGCCACGGCCACCACCAGCGCTGCTCCGTATCCCCCCAGGCCCCGGTCTAAAGCGTGGGGAGCCAAGTGCACGTAAGGAACGAAGTACCCGCCCCCAACCAGGGCTGTGCCTAGAGCAAAGATTGAGAAGGCCCGGCGTGTGAACAGACTCAGGCCGAGGGCAGCTAGGGGACTACGCGGTGGGGCTGGGGGGTCTCCAGGAAGGACCAGGGGTAGCAGCAGGGCGCCACAGGGGGTGAGGTGGAGGGTGATCGCGCCGAGGAGGAGCAGAGCGCCCCGCCAGCCGAAAGTATCGAGAAGAAGCTGCAAGGCGGGCGCCAGGAGCAGCGAGGAGGCCCCGTTGCCGGTGAGCGCCAGCCCCACCGCCAAGACTCGACGGCGGGAGAAGTAACGCGAGAGGGTGCCTAGGGCGGGGGCGAACACCAGGGCCCAACCAAAGCCTGCGAATGAATAGGAGGGGATGGGGGCCGGCACTGGGGACGCCCGCCCCAGCATTCCCAGCCCGGCTCTCCGCACCAGGCCCCCGCCTCGTTCGCTACCCCAGATCCCAACAAGCTCCTGTCACCTCCTTCACCCTGAATGACCCGGGCATCCCACTTCCCTCACCAGCGAGGAGGCCCAGGCCGAGGTAGAGATGCAGCAGATCGCTGGCGAAAGCCGAGAAGACGAAGCCCAGCGAGGCGAGGACGCCCCCAACCATCACCACGGGGCGGGCCCCCCAGCGCGTGCTCAGGGCGCTGCCCACGGGGCCTGAAAGGGGGCGGAGTCAACGGAAGACACGCCCCCGGGCCCCCAAACTCTCTCCAACACTTCTCATTGGCTGTTTGCCTCCCTCGAACTAATGGTTCTTCTCCTTGACCTCAAGATGCACTCTTTTCTAGAGCCGGTTGCCCTTTCTCAGGTTGTCGGGTAATCAGGTGGGTCTCCCATCCCTCCACTCACGGCTCCTCCTCCCCGTTCCTGTCTCCCGCCTCAGGGCCCCCCTCACTGGCTGCCTGCTGCACGGCCAGGGCCAGGGCGCTGATCCACGCAGTGTCCTGGGCGCTTCGGTCAAAGTGCTCGGCAAGGTCAGGGAAGGCAAGGCCCAGCGAGCGCAGCAGCCCGTAGGACAGCCCGTTTATCGCGAAGGCTGCGGCCGCCACCACCCAGCCCCAGCCCCCATCCGGGGGTCCGGCGGGCTGGGGGGTCATCGCCGTCTGCGGGGTGGGGAAACATCTGTGAGAGAAGCCTCCACGCCTGTGCTTCCGCTGGGGAGCTGGCATCCCTGAGATCCAGCCTCTGGCTGCTCGCCCGGGGTGGGCCCGTCACTCCGAGCGCGATGGCGCGGGGCGGAGGGAGCCGGAGCCTGGCCTAGGGCTGGAACTCCCGGGTCCGCGCGAGGTACGGGGACGGGGACAGCCAGATCCCCAGGCCCGAGGTTCCCCGTCCCACGCACACTCCTTCCCCCTTACCCGACCTCTCCGGGCGGTGCGGGGAGGGGAAGGGTGAGGAAGGGCTGGGCCCGGCTTTCTCTCTGCTTCCCAGGCGGGCGGGGGCCCCGAAGGGGAGCGAGGGCAGCGATGGAGCCCAACTTGGACGGGCTCTCTCGGTAAACAGAGATCACCACAGGGGCCTGAGCCACCTGGGACGCGGGGTGTACGGAGGGCGGGGGCGAGCTGAAACAGCCAATCCGGCAAGCCGCGCGTGAGGCCAGACGCCAGTCCCACGGGGATTGAATTATGTACACACACAACCCCCAGGCCCGGGGGAGGGGGGAGCGGCGAGAAATG
In [101]:
import gzip
def getfasta(chr, start, end):
    Path = '/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/'
    start_reading = 0
    line_to_print = []
    with gzip.open(Path + 'hg19.fa.gz', 'rb') as fin:
        for line in fin:
            line = line.decode('utf-8').rstrip()
            if start_reading == 1 and line[0] != '>':
                line_to_print.append(line)
            if line == '>chr' + str(chr):
                start_reading = 1
    print('\x1b[31m' + str(''.join(line_to_print)[start-1:end]) + '\x1b[0m')
getfasta(17, 6946902, 6946904)
CAT

The sequences look identical, still it is much, much faster with samtools (optimized C++ code) than with my Python implementation, so in the future for creating a fasta-file containing the Neandertal introgressed regions, we will do it via samtools faidx. Now let us turn to the Neandertal introgressed regions.

We downloaded the merged maps of Neandertal introgression for European + Asian 1000G populations published in Vernot and Akey, Science, 2014 from here https://akeylab.princeton.edu/downloads.html. Now let us read the coordinates of the introgressed regions and extract their sequences from the hg19 human genome and write the sequences in a new "introgression" fasta-file:

In [7]:
import pandas as pd
Path = '/home/nikolay/WABI/Misc/AncientDNA/NeandIntrogression/'
intr_coords = pd.read_csv(Path + '1000G_Akey_NeandIntrogression/all_haplotypes_populations.bed.all', 
                          header = None, sep = "\t")
intr_coords.head()
Out[7]:
0 1 2
0 chr1 834360 867552
1 chr1 1505425 1557644
2 chr1 1880284 1928481
3 chr1 1960388 2064958
4 chr1 2206041 2249092
In [8]:
intr_coords.shape
Out[8]:
(6721, 3)

Let us check the distribution of the lengths of the introgressed regions, this information will be important later when we select the k paremeter for the k-mer analysis when we split the sequences ("biological text") into k-mers ("biological words"):

In [9]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
intr_lengths = intr_coords.iloc[:, 2]-intr_coords.iloc[:, 1]
sns.distplot(intr_lengths)
plt.title("Distribution of Lengths of Neandertal Introgressed Regions", fontsize = 20)
plt.xlabel("Lengths of Neandertal Introgressed Regions", fontsize = 20)
plt.ylabel("Frequency", fontsize = 20)
plt.show()
In [10]:
from scipy import stats
print(stats.describe(intr_lengths))
DescribeResult(nobs=6721, minmax=(5251, 737082), mean=89296.93021871745, variance=4886040758.3506365, skewness=2.551246218418572, kurtosis=10.205693287318592)

We can see that the minimal DNA stretch of Neanderthal introgression is around 5 kbp and the maximal is close to 1 Mbp. Nevertheless, taking into account that the inbreeding occured approximately 50 000 years ago which is equaivalent to approximately 2000 human generations ago, this would be equivalent to less than 100 kbp regions survived from Neanderthals according to the estimates of David Reich https://www.nature.com/articles/nature12961. Therefore, the DNA regions from the right tail of the distribution are highly unlikely to trully come from Neanderthals but are most likely false positives, and the DNA sequences close to the mode of the distribution should be prioritized for further analysis.

In [11]:
import os
import subprocess
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/')
a = 0
with open('hg19_introgr_regions.fa', 'a') as fp:
    for i in range(intr_coords.shape[0]):
        coord = str(str(intr_coords.iloc[i, 0]) + ':' 
                    + str(intr_coords.iloc[i, 1]) + '-' + str(intr_coords.iloc[i, 2]))
        subprocess.run(['samtools', 'faidx', 'hg19.fa.gz', str(coord)], stdout = fp)
        a = a + 1
        if a%1000 == 0:
            print('Finished ' + str(a) + ' Neanderthal introgressed haplotypes')
Finished 1000 Neanderthal introgressed haplotypes
Finished 2000 Neanderthal introgressed haplotypes
Finished 3000 Neanderthal introgressed haplotypes
Finished 4000 Neanderthal introgressed haplotypes
Finished 5000 Neanderthal introgressed haplotypes
Finished 6000 Neanderthal introgressed haplotypes

It is interesting to try to make a simple k-mer analysis with kat, we downloaded and installed kat from here https://kat.readthedocs.io/en/latest/index.html and run it using the following command line:

In [48]:
# kat hist hg19_introgr_regions.fa
# kat plot spectra-hist -x 500 kat.hist
'''
Kmer Analysis Toolkit (KAT) V2.4.2

Running KAT in HIST mode
------------------------

Input 1 is a sequence file.  Counting kmers for input 1 (hg19_introgr_regions.fa) ...
Warning: Specified hash size insufficent - attempting to double hash size... success!

Warning: Specified hash size insufficent - attempting to double hash size... success!

Warning: Specified hash size insufficent - attempting to double hash size... success!
 done.  Time taken: 517.0s

Bining kmers ... done.  Time taken: 51.8s

Merging counts ... done.  Time taken: 0.0s

Saving results to disk ... done.  Time taken: 0.0s

Creating plot ... done.  Time taken: 1.4s

Analysing peaks
---------------

Analysing distributions for: kat.hist ... done.  Time taken:  3.7s

K-mer frequency spectra statistics
----------------------------------
K-value used: 27
Peaks in analysis: 3
Global minima @ Frequency=77x (1054)
Global maxima @ Frequency=79x (1085)
Overall mean k-mer frequency: 163x

  Index    Left    Mean    Right    StdDev    Max    Volume  Description
-------  ------  ------  -------  --------  -----  --------  -------------
      1   14.41      79   143.59     32.3     306     24648  1X
      2   17.66     156   294.34     69.17    234     40154  2X
      3   30.04     395   759.96    182.48     31     10257  5X

Calculating genome statistics
-----------------------------
Assuming that homozygous peak is the largest in the spectra with frequency of: 79x
Homozygous peak index: 1
CAUTION: the following estimates are based on having a clean spectra and having identified 
the correct homozygous peak!
Estimated genome size: 0.14 Mbp

Creating plots
--------------

Plotting K-mer frequency distributions ... done.  Saved to: None


KAT HIST completed.
Total runtime: 574.5s
'''
In [263]:
from IPython.display import Image
Path = '/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/'
Image(Path + 'kat_intr-spectra-hist.png', width=2000)
Out[263]:

Extract Regions of Reduced Neanderthal Ancestry

Now we need a set of control sequences, i.e. a set of identical length regions extracted from the regions of depleted Neanderthal ancestry.

In [57]:
intr_coords.head()
Out[57]:
0 1 2
0 chr1 834360 867552
1 chr1 1505425 1557644
2 chr1 1880284 1928481
3 chr1 1960388 2064958
4 chr1 2206041 2249092
In [233]:
list(intr_lengths)[0:10]
Out[233]:
[33192, 52219, 48197, 104570, 43051, 83170, 58798, 121590, 115844, 161767]
In [139]:
chr_sizes = pd.read_csv("hg19.fa.gz.fai", header=None, sep="\t")
chr_sizes.head()
Out[139]:
0 1 2 3 4
0 chr1 249250621 6 50 51
1 chr2 243199373 254235646 50 51
2 chr3 198022430 502299013 50 51
3 chr4 191154276 704281898 50 51
4 chr5 180915260 899259266 50 51
In [241]:
i = 0

chr_df = intr_coords[intr_coords[0] == intr_coords[0][i]]
overlap = True
while overlap == True:
    reg_start = np.random.randint(1, int(chr_sizes[chr_sizes[0] == intr_coords[0][i]].iloc[:,1]))
    reg_end = reg_start + intr_lengths[i]
    print(reg_start)
    for j in range(chr_df.shape[0]):
        b1 = chr_df[1][j]
        b2 = chr_df[2][j]
        if (reg_start > b1 and reg_start < b2) or (reg_end > b1 and reg_end < b2):
            print(reg_start, reg_end, chr_df[1][j], chr_df[2][j])
            overlap = True
            break
        else:
            overlap = False
print(intr_coords[0][i], reg_start, reg_end)
230513872
230513872 230547064 230521064 230558720
121412471
chr1 121412471 121445663
In [256]:
chr_list = []
start_list = []
end_list = []
for i in range(intr_coords.shape[0]):
    #print(i)
    chr_df = intr_coords[intr_coords[0] == intr_coords[0][i]]
    b1 = chr_df[1][i]
    b2 = chr_df[2][i]
    overlap = True
    while overlap == True:
        reg_start = np.random.randint(1, int(chr_sizes[chr_sizes[0] == intr_coords[0][i]].iloc[:,1]))
        reg_end = reg_start + intr_lengths[i]
        #print(reg_start)
        for j in range(chr_df.shape[0]):
            if (reg_start > b1 and reg_start < b2) or (reg_end > b1 and reg_end < b2):
                #print(reg_start, reg_end, chr_df[1][j], chr_df[2][j])
                overlap = True
                break
            else:
                overlap = False
    #print(intr_coords[0][i], reg_start, reg_end)
    chr_list.append(intr_coords[0][i])
    start_list.append(reg_start)
    end_list.append(reg_end)
depl_coords = pd.DataFrame({'0': chr_list, '1': start_list, '2': end_list})
depl_coords.head()
Out[256]:
0 1 2
0 chr1 27739522 27772714
1 chr1 61614008 61666227
2 chr1 16466982 16515179
3 chr1 145434652 145539222
4 chr1 45933725 45976776
In [257]:
depl_coords.shape
Out[257]:
(6721, 3)
In [259]:
intr_coords.head()
Out[259]:
0 1 2
0 chr1 834360 867552
1 chr1 1505425 1557644
2 chr1 1880284 1928481
3 chr1 1960388 2064958
4 chr1 2206041 2249092
In [258]:
depl_lengths = depl_coords.iloc[:, 2] - depl_coords.iloc[:, 1]
In [260]:
list(intr_lengths)[0:10]
Out[260]:
[33192, 52219, 48197, 104570, 43051, 83170, 58798, 121590, 115844, 161767]
In [261]:
list(depl_lengths)[0:10]
Out[261]:
[33192, 52219, 48197, 104570, 43051, 83170, 58798, 121590, 115844, 161767]
In [262]:
import os
import subprocess
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/')
with open('hg19_depl_regions.fa', 'a') as fp:
    for i in range(depl_coords.shape[0]):
        coord = str(str(depl_coords.iloc[i, 0]) + ':' 
                    + str(depl_coords.iloc[i, 1]) + '-' + str(depl_coords.iloc[i, 2]))
        subprocess.run(['samtools', 'faidx', 'hg19.fa.gz', str(coord)], stdout = fp)
In [ ]:
# kat hist hg19_depl_regions.fa
'''
Kmer Analysis Toolkit (KAT) V2.4.2

Running KAT in HIST mode
------------------------

Input 1 is a sequence file.  Counting kmers for input 1 (hg19_depl_regions.fa) ...
Warning: Specified hash size insufficent - attempting to double hash size... success!

Warning: Specified hash size insufficent - attempting to double hash size... success!
 done.  Time taken: 303.3s

Bining kmers ... done.  Time taken: 61.1s

Merging counts ... done.  Time taken: 0.0s

Saving results to disk ... done.  Time taken: 0.0s

Creating plot ... done.  Time taken: 1.3s

Analysing peaks
---------------

Analysing distributions for: kat.hist ... done.  Time taken:  3.9s

K-mer frequency spectra statistics
----------------------------------
K-value used: 27
Peaks in analysis: 3
Global minima @ Frequency=68x (1358)
Global maxima @ Frequency=70x (1372)
Overall mean k-mer frequency: 146x

  Index    Left    Mean    Right    StdDev    Max    Volume  Description
-------  ------  ------  -------  --------  -----  --------  -------------
      1   13.66      70   126.34     28.17    406     28539  1X
      2   16.61     138   259.39     60.69    294     44290  2X
      3   28.38     350   671.62    160.81     38     12552  5X

Calculating genome statistics
-----------------------------
Assuming that homozygous peak is the largest in the spectra with frequency of: 70x
Homozygous peak index: 1
CAUTION: the following estimates are based on having a clean spectra and having identified the correct homozygous peak!
Estimated genome size: 0.15 Mbp

Creating plots
--------------

Plotting K-mer frequency distributions ... done.  Saved to: None


KAT HIST completed.
Total runtime: 370.1s
'''
In [264]:
from IPython.display import Image
Path = '/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/'
Image(Path + 'kat_depl-spectra-hist.png', width=2000)
Out[264]:

We do not see any different functional behavior in the k-mer spectra between Neanderthal introgressed and depleted regions. Let us plot the two spectra against each other in order to see if there are any differences:

In [280]:
import numpy as np
import matplotlib.pyplot as plt
intr_spectr = pd.read_csv('kat_intr.hist', header = None, comment = "#", sep = " ")
depl_spectr = pd.read_csv('kat_depl.hist', header = None, comment = "#", sep = " ")
plt.figure(figsize=(20,15))
plt.plot(intr_spectr[0], intr_spectr[1], c = 'blue')
plt.plot(depl_spectr[0], depl_spectr[1], c = 'red')
plt.xlim(70, 400)
plt.ylim(0, 1000)
plt.title('K-mer Spectra of Neanderthal Introgressed and Depleted Regions', fontsize = 20)
plt.xlabel('K-mer Multiplicity', fontsize = 20)
plt.ylabel('Frequency', fontsize = 20)
plt.show()

So no obvious difference between the Neanderthal introgressed and depleted regions. However, one problem we had here is high proportion of regions containing long stretches of N nucleotides, meaning sequencing or genome assembly failed for these regions, might be repetative regions. Here we count how much N-containing lines we have in the fasta-files.

In [282]:
!grep -c N hg19_introgr_regions.fa
397
In [283]:
!grep -c N hg19_depl_regions.fa
651867

Clean Fasta-Files from N-Containing Entries

Now we are going to clean the two fasta-files, i.e. Neanderthal introgressed and depleted regions, from entries containing N nucleotides. We could have selected from the very beginning only N-free regions, but since the damage is already done, for simplicity now we just remove the corresponding entries in both fasta-files if at least one of them contains N-nucleotides. In other words, if e.g. the introgressed fasta file contains a 5 kbp stretch of nucleotides where at least one of them is N, then we delete the whole stretch in the introgressed fasta-file and the corresponding 5 kbp stretch in the depeleted fasta-file.

In [314]:
from Bio import SeqIO
i = 0
for record in SeqIO.parse('hg19_introgr_regions.fa', 'fasta'):
    upper_record = record.seq.upper()
    if 'N' in upper_record:
        i = i + 1
print(i)

i = 0
for record in SeqIO.parse('hg19_depl_regions.fa', 'fasta'):
    upper_record = record.seq.upper()
    if 'N' in upper_record:
        i = i + 1
print(i)
11
505

It looks like the introgressed fasta-file has 11 entries containing N-nucleotide while the depleted fasta-file has 505 entries containing N-nucleotide. If those entries do not overlap, we will remove 516 entries from both fasta-files and convert the nucleoties into upper case in parallel, then will write the clean fasta-files into new files. We are going to convert the rest of the 6721 - 516 = 6205 entries into k-mers, which means we are going to have plenty of training daya for running Convolutional and Recurrent Neural Networks (CNN / RNN).

In [317]:
from Bio import SeqIO
for record in SeqIO.parse('hg19_introgr_regions.fa', 'fasta'):
    upper_record = record.seq.upper()
    if 'N' in upper_record:
        print(record.id)
chr10:64300331-64407831
chr2:31608174-31709440
chr2:31803557-31906844
chr2:33021841-33109648
chr2:33120746-33165152
chr21:33082494-33158443
chr21:43243448-43329377
chr22:19004802-19283682
chr3:60165205-60298045
chr4:8790165-8819778
chr4:40187108-40307705
In [326]:
from Bio import SeqIO

intr_file = 'hg19_introgr_regions.fa'
depl_file = 'hg19_depl_regions.fa'
a = 0
i = 0
with open('hg19_introgr_clean.fa', 'a') as intr_out, open('hg19_depl_clean.fa', 'a') as depl_out:
    for intr, depl in zip(SeqIO.parse(intr_file, 'fasta'), SeqIO.parse(depl_file, 'fasta')):
        upper_intr = intr.seq.upper()
        upper_depl = depl.seq.upper()
        a = a + 1
        if a%1000 == 0:
            print('Finished ' + str(a) + ' entries')
        if 'N' not in str(upper_intr) and 'N' not in str(upper_depl):
            intr.seq = upper_intr
            SeqIO.write(intr, intr_out, 'fasta')
            depl.seq = upper_depl
            SeqIO.write(depl, depl_out, 'fasta')
            i = i + 1
        else:
            continue
print('We have processed ' + str(a) + ' entries and written ' + str(i) + ' entries to two fasta-files')
Finished 1000 entries
Finished 2000 entries
Finished 3000 entries
Finished 4000 entries
Finished 5000 entries
Finished 6000 entries
We have processed 6721 entries and written 6207 entries to two fasta-files

Let us make sure that there are no N-nucleotides present in the clean fasta-files:

In [327]:
!grep -c N hg19_introgr_clean.fa
0
In [328]:
!grep -c N hg19_depl_clean.fa
0

Prepare Data for Convolutional Neural Network

Now it is time to build Neanderthal intogressed vs. depleted sequences for further inputing them to the Convolutional Neural Network (CNN). For simplicity we will select all sequences of the same length equal to 5251 bp meaning the minimal length of the introgressed stretch of DNA. For this purpose we are going to cut all the sequences longer than 5251.

In [56]:
import os
from Bio import SeqIO

os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/')

intr_file = 'hg19_introgr_clean.fa'
depl_file = 'hg19_depl_clean.fa'

a = 0
intr_seqs = []
depl_seqs = []
for intr, depl in zip(SeqIO.parse(intr_file, 'fasta'), SeqIO.parse(depl_file, 'fasta')):
    #intr_seqs.append(str(intr.seq)[0:np.min(depl_lengths)])
    #depl_seqs.append(str(depl.seq)[0:np.min(depl_lengths)])
    s_intr = str(intr.seq)[0:32]
    s_depl = str(depl.seq)[0:32]
    if s_intr.count('A')>0 and s_intr.count('C')>0 and s_intr.count('G')>0 and s_intr.count('T')>0 and \
    s_depl.count('A')>0 and s_depl.count('C')>0 and s_depl.count('G')>0 and s_depl.count('T')>0:
        intr_seqs.append(s_intr)
        depl_seqs.append(s_depl)
    a = a + 1
    if a%1000 == 0:
        print('Finished ' + str(a) + ' entries')
Finished 1000 entries
Finished 2000 entries
Finished 3000 entries
Finished 4000 entries
Finished 5000 entries
Finished 6000 entries
In [57]:
sequences = intr_seqs + depl_seqs
len(sequences)
Out[57]:
11998
In [58]:
import numpy as np
labels = list(np.ones(len(intr_seqs))) + list(np.zeros(len(depl_seqs)))
len(labels)
Out[58]:
11998

The next step is to organize the data into a format that can be passed into a deep learning algorithm. Most deep learning algorithms accept data in the form of vectors or matrices (or more generally, tensors). To get each DNA sequence in the form of a matrix, we use one-hot encoding, which encodes every base in a sequence in the form of a 4-dimensional vector, with a separate dimension for each base. We place a "1" in the dimension corresponding to the base found in the DNA sequence, and "0"s in all other slots. We then concatenate these 4-dimensional vectors together along the bases in the sequence to form a matrix.

In [59]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

import warnings
warnings.filterwarnings('ignore')

# The LabelEncoder encodes a sequence of bases as a sequence of integers: 0, 1, 2 and 3
integer_encoder = LabelEncoder()  
# The OneHotEncoder converts an array of integers to a sparse matrix where 
# each row corresponds to one possible value of each feature, i.e. only 01 and 1 are present in the matrix
one_hot_encoder = OneHotEncoder()   
input_features = []

for sequence in sequences:
  integer_encoded = integer_encoder.fit_transform(list(sequence))
  integer_encoded = np.array(integer_encoded).reshape(-1, 1)
  one_hot_encoded = one_hot_encoder.fit_transform(integer_encoded)
  input_features.append(one_hot_encoded.toarray())

np.set_printoptions(threshold = 40)
#print(input_features.shape)
input_features = np.stack(input_features)
print("Example sequence\n-----------------------")
print('DNA Sequence #1:\n',sequences[0][:10],'...',sequences[0][-10:])
print('One hot encoding of Sequence #1:\n',input_features[0].T)
Example sequence
-----------------------
DNA Sequence #1:
 CTCATTTCAG ... TCCTGGGAGT
One hot encoding of Sequence #1:
 [[0. 0. 0. ... 1. 0. 0.]
 [1. 0. 1. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 1. 0.]
 [0. 1. 0. ... 0. 0. 1.]]
In [60]:
one_hot_encoder = OneHotEncoder()
labels = np.array(labels).reshape(-1, 1)
input_labels = one_hot_encoder.fit_transform(labels).toarray()

print('Labels:\n',labels.T)
print('One-hot encoded labels:\n',input_labels.T)
Labels:
 [[1. 1. 1. ... 0. 0. 0.]]
One-hot encoded labels:
 [[0. 0. 0. ... 1. 1. 1.]
 [1. 1. 1. ... 0. 0. 0.]]

Now we are going to split the data set into train and test data sets:

In [61]:
from sklearn.model_selection import train_test_split

train_features, test_features, train_labels, test_labels = train_test_split(
    input_features, input_labels, test_size = 0.2, random_state = 42)
In [62]:
train_features.shape
Out[62]:
(9598, 32, 4)
In [63]:
train_labels.shape
Out[63]:
(9598, 2)
In [64]:
test_features.shape
Out[64]:
(2400, 32, 4)
In [65]:
test_labels.shape
Out[65]:
(2400, 2)

Now everything is ready for the classification with Convolutional Neural Network (CNN). Here we choose a simple 1D convolutional neural network (CNN), which is commonly used in deep learning for functional genomics applications.

Run Shallow Convolutional Neural Network

A CNN learns to recognize patterns that are generally invariant across space, by trying to match the input sequence to a number of learnable "filters" of a fixed size. In our dataset, the filters will be motifs within the DNA sequences. The CNN may then learn to combine these filters to recognize a larger structure (e.g. the presence or absence of the Neanderthal introgression site on a sequence). We will start with defining Convolutional Neural Network (CNN) model and summarize the fitting parameters of the model.

In [83]:
from keras.optimizers import SGD, Adam, Adadelta
from keras.layers import Conv1D, Dense, MaxPooling1D, Flatten, Dropout, Embedding, Activation
from keras.models import Sequential
from keras.regularizers import l1

import warnings
warnings.filterwarnings('ignore')

model = Sequential()

model.add(Conv1D(filters = 16, kernel_size = 5, padding = 'same', kernel_initializer = 'he_uniform', 
                 input_shape = (train_features.shape[1], 4)))
model.add(Activation("relu"))
model.add(Conv1D(filters = 16, kernel_size = 5, padding = 'same', kernel_initializer = 'he_uniform'))
model.add(Activation("relu"))
model.add(MaxPooling1D(pool_size = 2))
model.add(Dropout(0.3))

model.add(Flatten())
model.add(Dense(8, kernel_initializer = 'he_uniform'))
model.add(Activation("relu"))
model.add(Dropout(0.5))
model.add(Dense(2, activation = 'softmax'))

epochs = 100
lrate = 0.01
decay = lrate / epochs
sgd = SGD(lr = lrate, momentum = 0.9, decay = decay, nesterov = False)
model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['binary_accuracy'])
#model.compile(loss='binary_crossentropy', optimizer=Adam(lr = lrate), metrics=['binary_accuracy'])
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv1d_15 (Conv1D)           (None, 32, 16)            336       
_________________________________________________________________
activation_22 (Activation)   (None, 32, 16)            0         
_________________________________________________________________
conv1d_16 (Conv1D)           (None, 32, 16)            1296      
_________________________________________________________________
activation_23 (Activation)   (None, 32, 16)            0         
_________________________________________________________________
max_pooling1d_8 (MaxPooling1 (None, 16, 16)            0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 16, 16)            0         
_________________________________________________________________
flatten_8 (Flatten)          (None, 256)               0         
_________________________________________________________________
dense_15 (Dense)             (None, 8)                 2056      
_________________________________________________________________
activation_24 (Activation)   (None, 8)                 0         
_________________________________________________________________
dropout_4 (Dropout)          (None, 8)                 0         
_________________________________________________________________
dense_16 (Dense)             (None, 2)                 18        
=================================================================
Total params: 3,706
Trainable params: 3,706
Non-trainable params: 0
_________________________________________________________________

Now, we are ready to go ahead and train the neural network. We will further divide the training set into a training and validation set. We will train only on the reduced training set, but plot the loss curve on both the training and validation sets. Once the loss for the validation set stops improving or gets worse throughout the learning cycles, it is time to stop training because the model has already converged and may be just overfitting.

In [84]:
import warnings
warnings.filterwarnings('ignore')

history = model.fit(train_features, train_labels, 
                    epochs = epochs, verbose = 1, validation_split = 0.2, batch_size = 32, shuffle = True)
Train on 7678 samples, validate on 1920 samples
Epoch 1/100
7678/7678 [==============================] - 1s 152us/step - loss: 0.6998 - binary_accuracy: 0.5035 - val_loss: 0.6932 - val_binary_accuracy: 0.4995
Epoch 2/100
7678/7678 [==============================] - 1s 88us/step - loss: 0.6936 - binary_accuracy: 0.4964 - val_loss: 0.6933 - val_binary_accuracy: 0.4995
Epoch 3/100
7678/7678 [==============================] - 1s 90us/step - loss: 0.6933 - binary_accuracy: 0.5016 - val_loss: 0.6933 - val_binary_accuracy: 0.4995
Epoch 4/100
7678/7678 [==============================] - 1s 88us/step - loss: 0.6932 - binary_accuracy: 0.5061 - val_loss: 0.6931 - val_binary_accuracy: 0.4990
Epoch 5/100
7678/7678 [==============================] - 1s 89us/step - loss: 0.6931 - binary_accuracy: 0.5035 - val_loss: 0.6934 - val_binary_accuracy: 0.4995
Epoch 6/100
7678/7678 [==============================] - 1s 93us/step - loss: 0.6939 - binary_accuracy: 0.4919 - val_loss: 0.6931 - val_binary_accuracy: 0.4995
Epoch 7/100
7678/7678 [==============================] - 1s 87us/step - loss: 0.6935 - binary_accuracy: 0.4984 - val_loss: 0.6931 - val_binary_accuracy: 0.5005
Epoch 8/100
7678/7678 [==============================] - 1s 91us/step - loss: 0.6934 - binary_accuracy: 0.4982 - val_loss: 0.6935 - val_binary_accuracy: 0.4995
Epoch 9/100
7678/7678 [==============================] - 1s 98us/step - loss: 0.6930 - binary_accuracy: 0.5022 - val_loss: 0.6932 - val_binary_accuracy: 0.5010
Epoch 10/100
7678/7678 [==============================] - 1s 92us/step - loss: 0.6932 - binary_accuracy: 0.5129 - val_loss: 0.6933 - val_binary_accuracy: 0.5016
Epoch 11/100
7678/7678 [==============================] - 1s 93us/step - loss: 0.6924 - binary_accuracy: 0.5012 - val_loss: 0.6942 - val_binary_accuracy: 0.5026
Epoch 12/100
7678/7678 [==============================] - 1s 100us/step - loss: 0.6936 - binary_accuracy: 0.5013 - val_loss: 0.6932 - val_binary_accuracy: 0.5005
Epoch 13/100
7678/7678 [==============================] - 1s 100us/step - loss: 0.6935 - binary_accuracy: 0.4997 - val_loss: 0.6932 - val_binary_accuracy: 0.5016
Epoch 14/100
7678/7678 [==============================] - 1s 94us/step - loss: 0.6930 - binary_accuracy: 0.5057 - val_loss: 0.6933 - val_binary_accuracy: 0.5026
Epoch 15/100
7678/7678 [==============================] - 1s 95us/step - loss: 0.6932 - binary_accuracy: 0.5070 - val_loss: 0.6929 - val_binary_accuracy: 0.5120
Epoch 16/100
7678/7678 [==============================] - 1s 93us/step - loss: 0.6926 - binary_accuracy: 0.5124 - val_loss: 0.6927 - val_binary_accuracy: 0.5130
Epoch 17/100
7678/7678 [==============================] - 1s 95us/step - loss: 0.6925 - binary_accuracy: 0.5090 - val_loss: 0.6930 - val_binary_accuracy: 0.5135
Epoch 18/100
7678/7678 [==============================] - 1s 91us/step - loss: 0.6924 - binary_accuracy: 0.5203 - val_loss: 0.6940 - val_binary_accuracy: 0.4984
Epoch 19/100
7678/7678 [==============================] - 1s 94us/step - loss: 0.6911 - binary_accuracy: 0.5076 - val_loss: 0.6923 - val_binary_accuracy: 0.5260
Epoch 20/100
7678/7678 [==============================] - 1s 92us/step - loss: 0.6912 - binary_accuracy: 0.5133 - val_loss: 0.6927 - val_binary_accuracy: 0.5234
Epoch 21/100
7678/7678 [==============================] - 1s 93us/step - loss: 0.6914 - binary_accuracy: 0.5066 - val_loss: 0.6922 - val_binary_accuracy: 0.5094
Epoch 22/100
7678/7678 [==============================] - 1s 96us/step - loss: 0.6906 - binary_accuracy: 0.5206 - val_loss: 0.6921 - val_binary_accuracy: 0.5125
Epoch 23/100
7678/7678 [==============================] - 1s 100us/step - loss: 0.6898 - binary_accuracy: 0.5180 - val_loss: 0.6913 - val_binary_accuracy: 0.5281
Epoch 24/100
7678/7678 [==============================] - 1s 98us/step - loss: 0.6908 - binary_accuracy: 0.5191 - val_loss: 0.6912 - val_binary_accuracy: 0.5151
Epoch 25/100
7678/7678 [==============================] - 1s 93us/step - loss: 0.6896 - binary_accuracy: 0.5162 - val_loss: 0.6912 - val_binary_accuracy: 0.5240
Epoch 26/100
7678/7678 [==============================] - 1s 94us/step - loss: 0.6901 - binary_accuracy: 0.5255 - val_loss: 0.6910 - val_binary_accuracy: 0.5286
Epoch 27/100
7678/7678 [==============================] - 1s 94us/step - loss: 0.6883 - binary_accuracy: 0.5276 - val_loss: 0.6904 - val_binary_accuracy: 0.5266
Epoch 28/100
7678/7678 [==============================] - 1s 93us/step - loss: 0.6881 - binary_accuracy: 0.5220 - val_loss: 0.6904 - val_binary_accuracy: 0.5297
Epoch 29/100
7678/7678 [==============================] - 1s 95us/step - loss: 0.6874 - binary_accuracy: 0.5250 - val_loss: 0.6912 - val_binary_accuracy: 0.5219
Epoch 30/100
7678/7678 [==============================] - 1s 92us/step - loss: 0.6869 - binary_accuracy: 0.5221 - val_loss: 0.6900 - val_binary_accuracy: 0.5401
Epoch 31/100
7678/7678 [==============================] - 1s 93us/step - loss: 0.6879 - binary_accuracy: 0.5220 - val_loss: 0.6900 - val_binary_accuracy: 0.5307
Epoch 32/100
7678/7678 [==============================] - 1s 95us/step - loss: 0.6846 - binary_accuracy: 0.5298 - val_loss: 0.6895 - val_binary_accuracy: 0.5349
Epoch 33/100
7678/7678 [==============================] - 1s 97us/step - loss: 0.6858 - binary_accuracy: 0.5333 - val_loss: 0.6894 - val_binary_accuracy: 0.5344
Epoch 34/100
7678/7678 [==============================] - 1s 105us/step - loss: 0.6855 - binary_accuracy: 0.5341 - val_loss: 0.6903 - val_binary_accuracy: 0.5250
Epoch 35/100
7678/7678 [==============================] - 1s 100us/step - loss: 0.6837 - binary_accuracy: 0.5336 - val_loss: 0.6885 - val_binary_accuracy: 0.5406
Epoch 36/100
7678/7678 [==============================] - 1s 95us/step - loss: 0.6840 - binary_accuracy: 0.5303 - val_loss: 0.6881 - val_binary_accuracy: 0.5344
Epoch 37/100
7678/7678 [==============================] - 1s 93us/step - loss: 0.6837 - binary_accuracy: 0.5341 - val_loss: 0.6887 - val_binary_accuracy: 0.5411
Epoch 38/100
7678/7678 [==============================] - 1s 93us/step - loss: 0.6839 - binary_accuracy: 0.5293 - val_loss: 0.6879 - val_binary_accuracy: 0.5354
Epoch 39/100
7678/7678 [==============================] - 1s 96us/step - loss: 0.6824 - binary_accuracy: 0.5401 - val_loss: 0.6885 - val_binary_accuracy: 0.5297
Epoch 40/100
7678/7678 [==============================] - 1s 96us/step - loss: 0.6815 - binary_accuracy: 0.5473 - val_loss: 0.6885 - val_binary_accuracy: 0.5292
Epoch 41/100
7678/7678 [==============================] - 1s 97us/step - loss: 0.6791 - binary_accuracy: 0.5470 - val_loss: 0.6875 - val_binary_accuracy: 0.5245
Epoch 42/100
7678/7678 [==============================] - 1s 97us/step - loss: 0.6823 - binary_accuracy: 0.5469 - val_loss: 0.6877 - val_binary_accuracy: 0.5370
Epoch 43/100
7678/7678 [==============================] - 1s 96us/step - loss: 0.6792 - binary_accuracy: 0.5468 - val_loss: 0.6871 - val_binary_accuracy: 0.5333
Epoch 44/100
7678/7678 [==============================] - 1s 96us/step - loss: 0.6795 - binary_accuracy: 0.5408 - val_loss: 0.6879 - val_binary_accuracy: 0.5479
Epoch 45/100
7678/7678 [==============================] - 1s 101us/step - loss: 0.6762 - binary_accuracy: 0.5494 - val_loss: 0.6862 - val_binary_accuracy: 0.5411
Epoch 46/100
7678/7678 [==============================] - 1s 98us/step - loss: 0.6788 - binary_accuracy: 0.5447 - val_loss: 0.6856 - val_binary_accuracy: 0.5432
Epoch 47/100
7678/7678 [==============================] - 1s 95us/step - loss: 0.6772 - binary_accuracy: 0.5514 - val_loss: 0.6856 - val_binary_accuracy: 0.5458
Epoch 48/100
7678/7678 [==============================] - 1s 94us/step - loss: 0.6767 - binary_accuracy: 0.5490 - val_loss: 0.6851 - val_binary_accuracy: 0.5411
Epoch 49/100
7678/7678 [==============================] - 1s 95us/step - loss: 0.6794 - binary_accuracy: 0.5448 - val_loss: 0.6858 - val_binary_accuracy: 0.5448
Epoch 50/100
7678/7678 [==============================] - 1s 93us/step - loss: 0.6783 - binary_accuracy: 0.5517 - val_loss: 0.6860 - val_binary_accuracy: 0.5521
Epoch 51/100
7678/7678 [==============================] - 1s 96us/step - loss: 0.6744 - binary_accuracy: 0.5527 - val_loss: 0.6862 - val_binary_accuracy: 0.5479
Epoch 52/100
7678/7678 [==============================] - 1s 91us/step - loss: 0.6765 - binary_accuracy: 0.5555 - val_loss: 0.6867 - val_binary_accuracy: 0.5406
Epoch 53/100
7678/7678 [==============================] - 1s 90us/step - loss: 0.6762 - binary_accuracy: 0.5612 - val_loss: 0.6865 - val_binary_accuracy: 0.5401
Epoch 54/100
7678/7678 [==============================] - 1s 90us/step - loss: 0.6755 - binary_accuracy: 0.5567 - val_loss: 0.6872 - val_binary_accuracy: 0.5391
Epoch 55/100
7678/7678 [==============================] - 1s 88us/step - loss: 0.6711 - binary_accuracy: 0.5595 - val_loss: 0.6859 - val_binary_accuracy: 0.5583
Epoch 56/100
7678/7678 [==============================] - 1s 93us/step - loss: 0.6735 - binary_accuracy: 0.5599 - val_loss: 0.6860 - val_binary_accuracy: 0.5453
Epoch 57/100
7678/7678 [==============================] - 1s 90us/step - loss: 0.6700 - binary_accuracy: 0.5659 - val_loss: 0.6862 - val_binary_accuracy: 0.5474
Epoch 58/100
7678/7678 [==============================] - 1s 94us/step - loss: 0.6712 - binary_accuracy: 0.5604 - val_loss: 0.6861 - val_binary_accuracy: 0.5458
Epoch 59/100
7678/7678 [==============================] - 1s 93us/step - loss: 0.6728 - binary_accuracy: 0.5613 - val_loss: 0.6852 - val_binary_accuracy: 0.5536
Epoch 60/100
7678/7678 [==============================] - 1s 91us/step - loss: 0.6723 - binary_accuracy: 0.5663 - val_loss: 0.6847 - val_binary_accuracy: 0.5422
Epoch 61/100
7678/7678 [==============================] - 1s 91us/step - loss: 0.6735 - binary_accuracy: 0.5589 - val_loss: 0.6862 - val_binary_accuracy: 0.5448
Epoch 62/100
7678/7678 [==============================] - 1s 91us/step - loss: 0.6678 - binary_accuracy: 0.5638 - val_loss: 0.6857 - val_binary_accuracy: 0.5490
Epoch 63/100
7678/7678 [==============================] - 1s 92us/step - loss: 0.6701 - binary_accuracy: 0.5709 - val_loss: 0.6865 - val_binary_accuracy: 0.5469
Epoch 64/100
7678/7678 [==============================] - 1s 92us/step - loss: 0.6695 - binary_accuracy: 0.5681 - val_loss: 0.6854 - val_binary_accuracy: 0.5443
Epoch 65/100
7678/7678 [==============================] - 1s 92us/step - loss: 0.6657 - binary_accuracy: 0.5710 - val_loss: 0.6850 - val_binary_accuracy: 0.5552
Epoch 66/100
7678/7678 [==============================] - 1s 91us/step - loss: 0.6659 - binary_accuracy: 0.5690 - val_loss: 0.6853 - val_binary_accuracy: 0.5521
Epoch 67/100
7678/7678 [==============================] - 1s 92us/step - loss: 0.6695 - binary_accuracy: 0.5651 - val_loss: 0.6855 - val_binary_accuracy: 0.5625
Epoch 68/100
7678/7678 [==============================] - 1s 92us/step - loss: 0.6674 - binary_accuracy: 0.5675 - val_loss: 0.6844 - val_binary_accuracy: 0.5589
Epoch 69/100
7678/7678 [==============================] - 1s 94us/step - loss: 0.6653 - binary_accuracy: 0.5727 - val_loss: 0.6857 - val_binary_accuracy: 0.5620
Epoch 70/100
7678/7678 [==============================] - 1s 93us/step - loss: 0.6662 - binary_accuracy: 0.5737 - val_loss: 0.6861 - val_binary_accuracy: 0.5672
Epoch 71/100
7678/7678 [==============================] - 1s 97us/step - loss: 0.6640 - binary_accuracy: 0.5802 - val_loss: 0.6872 - val_binary_accuracy: 0.5630
Epoch 72/100
7678/7678 [==============================] - 1s 100us/step - loss: 0.6660 - binary_accuracy: 0.5783 - val_loss: 0.6862 - val_binary_accuracy: 0.5682
Epoch 73/100
7678/7678 [==============================] - 1s 97us/step - loss: 0.6651 - binary_accuracy: 0.5761 - val_loss: 0.6852 - val_binary_accuracy: 0.5719
Epoch 74/100
7678/7678 [==============================] - 1s 92us/step - loss: 0.6662 - binary_accuracy: 0.5733 - val_loss: 0.6856 - val_binary_accuracy: 0.5552
Epoch 75/100
7678/7678 [==============================] - 1s 91us/step - loss: 0.6630 - binary_accuracy: 0.5843 - val_loss: 0.6857 - val_binary_accuracy: 0.5635
Epoch 76/100
7678/7678 [==============================] - 1s 93us/step - loss: 0.6618 - binary_accuracy: 0.5798 - val_loss: 0.6855 - val_binary_accuracy: 0.5656
Epoch 77/100
7678/7678 [==============================] - 1s 93us/step - loss: 0.6618 - binary_accuracy: 0.5800 - val_loss: 0.6853 - val_binary_accuracy: 0.5630
Epoch 78/100
7678/7678 [==============================] - 1s 93us/step - loss: 0.6611 - binary_accuracy: 0.5897 - val_loss: 0.6851 - val_binary_accuracy: 0.5604
Epoch 79/100
7678/7678 [==============================] - 1s 92us/step - loss: 0.6639 - binary_accuracy: 0.5798 - val_loss: 0.6849 - val_binary_accuracy: 0.5724
Epoch 80/100
7678/7678 [==============================] - 1s 94us/step - loss: 0.6630 - binary_accuracy: 0.5886 - val_loss: 0.6873 - val_binary_accuracy: 0.5583
Epoch 81/100
7678/7678 [==============================] - 1s 92us/step - loss: 0.6613 - binary_accuracy: 0.5918 - val_loss: 0.6863 - val_binary_accuracy: 0.5542
Epoch 82/100
7678/7678 [==============================] - 1s 95us/step - loss: 0.6623 - binary_accuracy: 0.5819 - val_loss: 0.6869 - val_binary_accuracy: 0.5510
Epoch 83/100
7678/7678 [==============================] - 1s 98us/step - loss: 0.6570 - binary_accuracy: 0.5824 - val_loss: 0.6871 - val_binary_accuracy: 0.5661
Epoch 84/100
7678/7678 [==============================] - 1s 99us/step - loss: 0.6610 - binary_accuracy: 0.5830 - val_loss: 0.6868 - val_binary_accuracy: 0.5589
Epoch 85/100
7678/7678 [==============================] - 1s 92us/step - loss: 0.6575 - binary_accuracy: 0.5892 - val_loss: 0.6878 - val_binary_accuracy: 0.5583
Epoch 86/100
7678/7678 [==============================] - 1s 96us/step - loss: 0.6598 - binary_accuracy: 0.5857 - val_loss: 0.6874 - val_binary_accuracy: 0.5531
Epoch 87/100
7678/7678 [==============================] - 1s 95us/step - loss: 0.6608 - binary_accuracy: 0.5822 - val_loss: 0.6863 - val_binary_accuracy: 0.5578
Epoch 88/100
7678/7678 [==============================] - 1s 94us/step - loss: 0.6570 - binary_accuracy: 0.5921 - val_loss: 0.6869 - val_binary_accuracy: 0.5531
Epoch 89/100
7678/7678 [==============================] - 1s 95us/step - loss: 0.6591 - binary_accuracy: 0.5843 - val_loss: 0.6878 - val_binary_accuracy: 0.5557
Epoch 90/100
7678/7678 [==============================] - 1s 93us/step - loss: 0.6581 - binary_accuracy: 0.5809 - val_loss: 0.6874 - val_binary_accuracy: 0.5661
Epoch 91/100
7678/7678 [==============================] - 1s 93us/step - loss: 0.6569 - binary_accuracy: 0.5923 - val_loss: 0.6876 - val_binary_accuracy: 0.5552
Epoch 92/100
7678/7678 [==============================] - 1s 91us/step - loss: 0.6566 - binary_accuracy: 0.5910 - val_loss: 0.6883 - val_binary_accuracy: 0.5500
Epoch 93/100
7678/7678 [==============================] - 1s 93us/step - loss: 0.6547 - binary_accuracy: 0.5860 - val_loss: 0.6882 - val_binary_accuracy: 0.5469
Epoch 94/100
7678/7678 [==============================] - 1s 95us/step - loss: 0.6523 - binary_accuracy: 0.5930 - val_loss: 0.6896 - val_binary_accuracy: 0.5604
Epoch 95/100
7678/7678 [==============================] - 1s 92us/step - loss: 0.6536 - binary_accuracy: 0.5959 - val_loss: 0.6878 - val_binary_accuracy: 0.5536
Epoch 96/100
7678/7678 [==============================] - 1s 95us/step - loss: 0.6574 - binary_accuracy: 0.5920 - val_loss: 0.6878 - val_binary_accuracy: 0.5490
Epoch 97/100
7678/7678 [==============================] - 1s 94us/step - loss: 0.6524 - binary_accuracy: 0.5944 - val_loss: 0.6887 - val_binary_accuracy: 0.5604
Epoch 98/100
7678/7678 [==============================] - 1s 110us/step - loss: 0.6572 - binary_accuracy: 0.5874 - val_loss: 0.6875 - val_binary_accuracy: 0.5505
Epoch 99/100
7678/7678 [==============================] - 1s 93us/step - loss: 0.6514 - binary_accuracy: 0.5973 - val_loss: 0.6876 - val_binary_accuracy: 0.5589
Epoch 100/100
7678/7678 [==============================] - 1s 95us/step - loss: 0.6512 - binary_accuracy: 0.5970 - val_loss: 0.6892 - val_binary_accuracy: 0.5526
In [85]:
import matplotlib.pyplot as plt

plt.figure(figsize=(20,15))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss', fontsize = 20)
plt.ylabel('Loss', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()

plt.figure(figsize=(20,15))
plt.plot(history.history['binary_accuracy'])
plt.plot(history.history['val_binary_accuracy'])
plt.title('Model Accuracy', fontsize = 20)
plt.ylabel('Accuracy', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()

The best way to evaluate whether the network has learned to classify sequences is to evaluate its performance on a fresh test set consisting of data that it has not observed at all during training. Here, we evaluate the model on the test set and plot the results as a confusion matrix.

In [100]:
from sklearn.metrics import confusion_matrix
import itertools

plt.figure(figsize=(15,10))

predicted_labels = model.predict(np.stack(test_features))
cm = confusion_matrix(np.argmax(test_labels, axis=1), 
                      np.argmax(predicted_labels, axis=1))
print('Confusion matrix:\n',cm)

cm = cm.astype('float') / cm.sum(axis = 1)[:, np.newaxis]

plt.imshow(cm, cmap = plt.cm.Blues)
plt.title('Normalized confusion matrix', fontsize = 20)
plt.colorbar()
plt.xlabel('True label', fontsize = 20)
plt.ylabel('Predicted label', fontsize = 20)
#plt.xticks([0, 1]); plt.yticks([0, 1])
#plt.grid('off')
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
    plt.text(j, i, format(cm[i, j], '.2f'),
             horizontalalignment = 'center', verticalalignment = 'center', fontsize = 20,
             color='white' if cm[i, j] > 0.5 else 'black')
plt.show()
Confusion matrix:
 [[395 769]
 [351 885]]
In [87]:
scores = model.evaluate(test_features, test_labels, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
Accuracy: 53.33%

The accuracy is not fantastic, apparently there is not enough signal in the data, i.e. Neanderthal introgressed regions look very similarly in sense of their sequence to the regions with reduced Neanderthal ancestry. This can have many reasons both biological and technical. One technical artifact can be bad annotation of the Nenanderthal introgressed regions byt Vernot and Akey, we could use David Reich's annotation as an alternative.

Let us now take a step back and simplify the problem. Let us address something simple which should be solvable within the current framework. For example we can try to classify sequences to belong either to gene or intergenic regions.

Gene vs. Non-Gene Sequence Classification

Here we will follow roughy the same line as for Neanderthal introgressed vs. depleted sequence classification and perform a simple "sentiment analysis", i.e. a simple gene vs. non-gene sequence classification. This should be a simple task as genes have start and stop codon regions which should be recognized and learnt by CNN quite easily. We will start with building an annotation file for protein-coding genes. We downloaded RefSeq annotation file from http://genome.ucsc.edu/cgi-bin/hgTables as a text-file and used "genePredToGtf" tool to build the refGene_hg19.gtf gtf-annotation file.

In [107]:
!cut -f9 refGene_hg19.gtf | cut -d ';' -f1 | cut -d ' ' -f2 | uniq | tr -d '\"' > refseq_gene_list_hg19.txt
In [104]:
%load_ext rpy2.ipython
In [108]:
%%R
library("biomaRt")
ensembl <- useMart("ensembl",host="grch37.ensembl.org")
ensembl <- useDataset("hsapiens_gene_ensembl",mart=ensembl)
refseq_genes <- scan("refseq_gene_list_hg19.txt", what = "character")
output <- getBM(attributes=c('hgnc_symbol', 'chromosome_name', 'start_position', 'end_position', 'strand'), 
                filters = c('hgnc_symbol'),values = refseq_genes,mart = ensembl)
head(output)
W1027 11:40:41.391831 139658277726016 callbacks.py:119] R[write to console]: Read 49762 items

W1027 11:40:41.982521 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:41.983824 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [>------------------------------]   2% eta: 25s
W1027 11:40:43.002393 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:43.003472 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [>------------------------------]   3% eta: 50s
W1027 11:40:43.212183 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:43.213459 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [>------------------------------]   4% eta: 42s
W1027 11:40:44.193898 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:44.194903 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=>-----------------------------]   5% eta:  1m
W1027 11:40:44.403692 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:44.405073 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=>-----------------------------]   6% eta: 46s
W1027 11:40:44.623110 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:44.624358 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=>-----------------------------]   7% eta: 42s
W1027 11:40:44.841355 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:44.842393 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=>-----------------------------]   8% eta: 39s
W1027 11:40:45.163364 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:45.164531 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [==>----------------------------]   9% eta: 37s
W1027 11:40:45.391749 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:45.392846 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [==>----------------------------]  10% eta: 35s
W1027 11:40:45.621610 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:45.622612 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [==>----------------------------]  11% eta: 34s
W1027 11:40:45.846666 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:45.848308 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [===>---------------------------]  12% eta: 32s
W1027 11:40:46.081001 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:46.082109 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [===>---------------------------]  13% eta: 31s
W1027 11:40:46.313906 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:46.315016 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [===>---------------------------]  14% eta: 30s
W1027 11:40:46.550505 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:46.551652 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [====>--------------------------]  15% eta: 29s
W1027 11:40:46.792051 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:46.793228 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [====>--------------------------]  16% eta: 28s
W1027 11:40:47.022980 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:47.024010 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [====>--------------------------]  17% eta: 27s
W1027 11:40:47.366645 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:47.367857 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=====>-------------------------]  18% eta: 27s
W1027 11:40:47.578999 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:47.580167 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=====>-------------------------]  19% eta: 26s
W1027 11:40:47.797602 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:47.798822 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=====>-------------------------]  20% eta: 25s
W1027 11:40:48.009146 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:48.010340 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [======>------------------------]  21% eta: 25s
W1027 11:40:48.279623 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:48.280768 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [======>------------------------]  22% eta: 24s
W1027 11:40:48.498848 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:48.500045 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [======>------------------------]  23% eta: 24s
W1027 11:40:48.703610 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:48.704659 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [======>------------------------]  24% eta: 23s
W1027 11:40:48.913924 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:48.914991 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=======>-----------------------]  25% eta: 22s
W1027 11:40:49.135506 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:49.136539 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=======>-----------------------]  26% eta: 22s
W1027 11:40:49.361578 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:49.362917 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=======>-----------------------]  27% eta: 21s
W1027 11:40:49.579699 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:49.580809 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [========>----------------------]  28% eta: 21s
W1027 11:40:49.799143 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:49.800252 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [========>----------------------]  29% eta: 20s
W1027 11:40:50.011209 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:50.012311 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [========>----------------------]  30% eta: 20s
W1027 11:40:50.233919 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:50.234920 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=========>---------------------]  31% eta: 20s
W1027 11:40:50.818492 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:50.819679 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=========>---------------------]  32% eta: 20s
W1027 11:40:51.029662 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:51.030911 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=========>---------------------]  33% eta: 19s
W1027 11:40:51.880559 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:51.881859 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [==========>--------------------]  34% eta: 20s
W1027 11:40:52.098784 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:52.099934 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [==========>--------------------]  35% eta: 20s
W1027 11:40:52.317592 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:52.318644 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [==========>--------------------]  36% eta: 19s
W1027 11:40:52.534108 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:52.535915 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [==========>--------------------]  37% eta: 19s
W1027 11:40:52.994334 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:52.995352 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [===========>-------------------]  38% eta: 19s
W1027 11:40:53.237683 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:53.239075 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [===========>-------------------]  39% eta: 18s
W1027 11:40:53.448115 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:53.449355 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [===========>-------------------]  40% eta: 18s
W1027 11:40:53.666050 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:53.667215 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [============>------------------]  41% eta: 18s
W1027 11:40:53.886385 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:53.887526 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [============>------------------]  42% eta: 17s
W1027 11:40:54.263547 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:54.264669 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [============>------------------]  43% eta: 17s
W1027 11:40:54.523161 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:54.524213 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=============>-----------------]  44% eta: 17s
W1027 11:40:54.792277 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:54.793410 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=============>-----------------]  45% eta: 16s
W1027 11:40:55.018378 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:55.019436 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=============>-----------------]  46% eta: 16s
W1027 11:40:55.238388 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:55.239545 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [==============>----------------]  47% eta: 16s
W1027 11:40:55.451822 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:55.453081 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [==============>----------------]  48% eta: 15s
W1027 11:40:55.670448 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:55.671464 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [==============>----------------]  49% eta: 15s
W1027 11:40:55.888728 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:55.890085 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [===============>---------------]  50% eta: 14s
W1027 11:40:56.132714 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:56.133843 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [===============>---------------]  51% eta: 14s
W1027 11:40:56.356799 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:56.357877 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [===============>---------------]  52% eta: 14s
W1027 11:40:56.575294 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:56.576718 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [===============>---------------]  53% eta: 13s
W1027 11:40:56.786025 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:56.787047 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [================>--------------]  54% eta: 13s
W1027 11:40:57.012868 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:57.013996 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [================>--------------]  55% eta: 13s
W1027 11:40:57.286858 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:57.288101 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [================>--------------]  56% eta: 12s
W1027 11:40:58.668767 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:58.669905 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=================>-------------]  57% eta: 13s
W1027 11:40:59.730998 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:59.732045 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=================>-------------]  58% eta: 13s
W1027 11:40:59.947439 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:40:59.948489 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=================>-------------]  59% eta: 13s
W1027 11:41:00.160492 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:00.161584 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [==================>------------]  60% eta: 12s
W1027 11:41:00.843695 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:00.845040 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [==================>------------]  61% eta: 12s
W1027 11:41:01.097782 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:01.099852 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [==================>------------]  62% eta: 12s
W1027 11:41:01.337388 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:01.338583 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [===================>-----------]  63% eta: 12s
W1027 11:41:01.552093 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:01.553249 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [===================>-----------]  64% eta: 11s
W1027 11:41:01.772850 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:01.775531 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [===================>-----------]  65% eta: 11s
W1027 11:41:01.998129 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:01.999359 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [===================>-----------]  66% eta: 11s
W1027 11:41:02.382618 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:02.384387 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [====================>----------]  67% eta: 10s
W1027 11:41:02.621140 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:02.622253 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [====================>----------]  68% eta: 10s
W1027 11:41:02.841887 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:02.843068 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [====================>----------]  69% eta: 10s
W1027 11:41:03.061557 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:03.062543 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=====================>---------]  70% eta:  9s
W1027 11:41:03.289378 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:03.290740 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=====================>---------]  71% eta:  9s
W1027 11:41:03.512754 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:03.513933 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=====================>---------]  72% eta:  9s
W1027 11:41:03.730867 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:03.732159 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [======================>--------]  73% eta:  8s
W1027 11:41:03.956536 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:03.957709 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [======================>--------]  74% eta:  8s
W1027 11:41:04.355638 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:04.356685 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [======================>--------]  75% eta:  8s
W1027 11:41:04.582739 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:04.583774 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=======================>-------]  76% eta:  7s
W1027 11:41:04.798639 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:04.799671 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=======================>-------]  77% eta:  7s
W1027 11:41:05.022374 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:05.023411 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=======================>-------]  78% eta:  7s
W1027 11:41:05.274255 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:05.275456 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=======================>-------]  79% eta:  6s
W1027 11:41:05.547676 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:05.548720 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [========================>------]  80% eta:  6s
W1027 11:41:06.583195 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:06.584232 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [========================>------]  81% eta:  6s
W1027 11:41:06.793331 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:06.794525 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [========================>------]  82% eta:  6s
W1027 11:41:07.683015 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:07.684079 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=========================>-----]  83% eta:  5s
W1027 11:41:07.902681 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:07.903951 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=========================>-----]  84% eta:  5s
W1027 11:41:08.759317 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:08.760653 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=========================>-----]  85% eta:  5s
W1027 11:41:08.990818 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:08.991901 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [==========================>----]  86% eta:  4s
W1027 11:41:09.279020 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:09.280170 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [==========================>----]  87% eta:  4s
W1027 11:41:09.473287 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:09.474262 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [==========================>----]  88% eta:  4s
W1027 11:41:09.692486 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:09.693601 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [===========================>---]  89% eta:  3s
W1027 11:41:09.894894 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:09.895874 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [===========================>---]  90% eta:  3s
W1027 11:41:10.098249 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:10.099393 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [===========================>---]  91% eta:  3s
W1027 11:41:10.320140 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:10.321382 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [============================>--]  92% eta:  3s
W1027 11:41:10.535534 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:10.536564 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [============================>--]  93% eta:  2s
W1027 11:41:10.746902 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:10.747936 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [============================>--]  94% eta:  2s
W1027 11:41:10.962227 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:10.963312 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [============================>--]  95% eta:  2s
W1027 11:41:11.179636 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:11.180706 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=============================>-]  96% eta:  1s
W1027 11:41:11.391198 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:11.392233 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=============================>-]  97% eta:  1s
W1027 11:41:11.598742 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:11.599742 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [=============================>-]  98% eta:  1s
W1027 11:41:11.769702 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:11.770810 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [==============================>]  99% eta:  0s
W1027 11:41:11.941489 139658277726016 callbacks.py:119] R[write to console]: 
W1027 11:41:11.942625 139658277726016 callbacks.py:119] R[write to console]: Batch submitting query [===============================] 100% eta:  0s
                                                                      sole]: 
W1027 11:41:11.945661 139658277726016 callbacks.py:119] R[write to console]: 
  hgnc_symbol   chromosome_name start_position end_position strand
1     ABCA11P                 4         419224       467918     -1
2     ABCA11P HG174_HG254_PATCH         425435       474129     -1
3       ACYP2                 2       54197975     54532437      1
4        ADAR                 1      154554538    154600475     -1
5       ADCY1                 7       45613739     45762715      1
6        ADD3                10      111756126    111895323      1
In [110]:
%%R
output_clean <- output[as.character(output$chromosome_name)%in%c(as.character(seq(1:22)),"X","Y"),]
dim(output_clean)
[1] 40965     5
In [111]:
%%R
head(output_clean)
  hgnc_symbol chromosome_name start_position end_position strand
1     ABCA11P               4         419224       467918     -1
3       ACYP2               2       54197975     54532437      1
4        ADAR               1      154554538    154600475     -1
5       ADCY1               7       45613739     45762715      1
6        ADD3              10      111756126    111895323      1
7      ADIPOQ               3      186560463    186576252      1
In [113]:
%%R
output_to_write <- output_clean
output_to_write$hgnc_symbol <- NULL
output_to_write$strand <- NULL
output_to_write$chromosome_name <- paste0("chr",output_to_write$chromosome_name)
write.table(output_to_write, file="gene_coords.txt",col.names=FALSE,row.names=FALSE,quote=FALSE,sep="\t")
In [118]:
import pandas as pd
Path = '/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/'
gene_coords = pd.read_csv(Path + 'gene_coords.txt', 
                          header=None, sep="\t")
gene_coords.head()
Out[118]:
0 1 2
0 chr4 419224 467918
1 chr2 54197975 54532437
2 chr1 154554538 154600475
3 chr7 45613739 45762715
4 chr10 111756126 111895323
In [119]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
gene_lengths = gene_coords.iloc[:, 2]-gene_coords.iloc[:, 1]
sns.distplot(gene_lengths)
plt.title("Distribution of Gene Lengths", fontsize = 20)
plt.xlabel("Lengths of Genes", fontsize = 20)
plt.ylabel("Frequency", fontsize = 20)
plt.show()
In [120]:
from scipy import stats
print(stats.describe(gene_lengths))
DescribeResult(nobs=40965, minmax=(40, 2304637), mean=76316.16152813377, variance=23479516110.360516, skewness=6.084410282367497, kurtosis=55.65833251525486)
In [121]:
import os
import subprocess
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/')
with open('hg19_gene_regions.fa', 'a') as fp:
    for i in range(gene_coords.shape[0]):
        coord = str(str(gene_coords.iloc[i, 0]) + ':' 
                    + str(gene_coords.iloc[i, 1]) + '-' + str(gene_coords.iloc[i, 2]))
        subprocess.run(['samtools', 'faidx', 'hg19.fa.gz', str(coord)], stdout = fp)
In [122]:
chr_sizes = pd.read_csv("hg19.fa.gz.fai", header = None, sep = "\t")
chr_sizes.head()
Out[122]:
0 1 2 3 4
0 chr1 249250621 6 50 51
1 chr2 243199373 254235646 50 51
2 chr3 198022430 502299013 50 51
3 chr4 191154276 704281898 50 51
4 chr5 180915260 899259266 50 51
In [124]:
chr_list = []
start_list = []
end_list = []
for i in range(gene_coords.shape[0]):
    chr_df = gene_coords[gene_coords[0] == gene_coords[0][i]]
    b1 = chr_df[1][i]
    b2 = chr_df[2][i]
    overlap = True
    while overlap == True:
        reg_start = np.random.randint(1, int(chr_sizes[chr_sizes[0] == gene_coords[0][i]].iloc[:,1]))
        reg_end = reg_start + gene_lengths[i]
        for j in range(chr_df.shape[0]):
            if (reg_start > b1 and reg_start < b2) or (reg_end > b1 and reg_end < b2):
                overlap = True
                break
            else:
                overlap = False
    chr_list.append(gene_coords[0][i])
    start_list.append(reg_start)
    end_list.append(reg_end)
notgene_coords = pd.DataFrame({'0': chr_list, '1': start_list, '2': end_list})
notgene_coords.to_csv("notgene_coords.txt", index = False, header = False, sep = "\t")
notgene_coords.head()
Out[124]:
0 1 2
0 chr4 71441514 71490208
1 chr2 218551269 218885731
2 chr1 17234237 17280174
3 chr7 74497155 74646131
4 chr10 24570946 24710143
In [127]:
import os
import subprocess
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/')
with open('hg19_notgene_regions.fa', 'a') as fp:
    for i in range(notgene_coords.shape[0]):
        coord = str(str(notgene_coords.iloc[i, 0]) + ':' 
                    + str(notgene_coords.iloc[i, 1]) + '-' + str(notgene_coords.iloc[i, 2]))
        subprocess.run(['samtools', 'faidx', 'hg19.fa.gz', str(coord)], stdout = fp)
In [128]:
!grep -c N hg19_gene_regions.fa
67562
In [129]:
!grep -c N hg19_notgene_regions.fa
3194473
In [130]:
from Bio import SeqIO
i = 0
for record in SeqIO.parse('hg19_gene_regions.fa', 'fasta'):
    upper_record = record.seq.upper()
    if 'N' in upper_record:
        i = i + 1
print(i)

i = 0
for record in SeqIO.parse('hg19_notgene_regions.fa', 'fasta'):
    upper_record = record.seq.upper()
    if 'N' in upper_record:
        i = i + 1
print(i)
98
3019
In [132]:
from Bio import SeqIO

gene_file = 'hg19_gene_regions.fa'
notgene_file = 'hg19_notgene_regions.fa'
a = 0
i = 0
with open('hg19_gene_clean.fa', 'a') as gene_out, open('hg19_notgene_clean.fa', 'a') as notgene_out:
    for gene, notgene in zip(SeqIO.parse(gene_file, 'fasta'), SeqIO.parse(notgene_file, 'fasta')):
        upper_gene = gene.seq.upper()
        upper_notgene = notgene.seq.upper()
        a = a + 1
        if a%1000 == 0:
            print('Finished ' + str(a) + ' entries')
        if 'N' not in str(upper_gene) and 'N' not in str(upper_notgene):
            gene.seq = upper_gene
            SeqIO.write(gene, gene_out, 'fasta')
            notgene.seq = upper_notgene
            SeqIO.write(notgene, notgene_out, 'fasta')
            i = i + 1
        else:
            continue
print('We have processed ' + str(a) + ' entries and written ' + str(i) + ' entries to two fasta-files')
Finished 1000 entries
Finished 2000 entries
Finished 3000 entries
Finished 4000 entries
Finished 5000 entries
Finished 6000 entries
Finished 7000 entries
Finished 8000 entries
Finished 9000 entries
Finished 10000 entries
Finished 11000 entries
Finished 12000 entries
Finished 13000 entries
Finished 14000 entries
Finished 15000 entries
Finished 16000 entries
Finished 17000 entries
Finished 18000 entries
Finished 19000 entries
Finished 20000 entries
Finished 21000 entries
Finished 22000 entries
Finished 23000 entries
Finished 24000 entries
Finished 25000 entries
Finished 26000 entries
Finished 27000 entries
Finished 28000 entries
Finished 29000 entries
Finished 30000 entries
Finished 31000 entries
Finished 32000 entries
Finished 33000 entries
Finished 34000 entries
Finished 35000 entries
Finished 36000 entries
Finished 37000 entries
Finished 38000 entries
Finished 39000 entries
Finished 40000 entries
We have processed 40965 entries and written 37853 entries to two fasta-files
In [133]:
!grep -c N hg19_gene_clean.fa
0
In [134]:
!grep -c N hg19_notgene_clean.fa
0
In [135]:
import os
from Bio import SeqIO

os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/')

gene_file = 'hg19_gene_clean.fa'
notgene_file = 'hg19_notgene_clean.fa'

a = 0
gene_seqs = []
notgene_seqs = []
for gene, notgene in zip(SeqIO.parse(gene_file, 'fasta'), SeqIO.parse(notgene_file, 'fasta')):
    #gene_seqs.append(str(gene.seq)[0:np.min(gene_lengths)])
    #notgene_seqs.append(str(notgene.seq)[0:np.min(notgene_lengths)])
    s_gene = str(gene.seq)[0:32]
    s_notgene = str(notgene.seq)[0:32]
    if s_gene.count('A')>0 and s_gene.count('C')>0 and s_gene.count('G')>0 and s_gene.count('T')>0 and \
    s_notgene.count('A')>0 and s_notgene.count('C')>0 and s_notgene.count('G')>0 and s_notgene.count('T')>0:
        gene_seqs.append(s_gene)
        notgene_seqs.append(s_notgene)
    a = a + 1
    if a%1000 == 0:
        print('Finished ' + str(a) + ' entries')
Finished 1000 entries
Finished 2000 entries
Finished 3000 entries
Finished 4000 entries
Finished 5000 entries
Finished 6000 entries
Finished 7000 entries
Finished 8000 entries
Finished 9000 entries
Finished 10000 entries
Finished 11000 entries
Finished 12000 entries
Finished 13000 entries
Finished 14000 entries
Finished 15000 entries
Finished 16000 entries
Finished 17000 entries
Finished 18000 entries
Finished 19000 entries
Finished 20000 entries
Finished 21000 entries
Finished 22000 entries
Finished 23000 entries
Finished 24000 entries
Finished 25000 entries
Finished 26000 entries
Finished 27000 entries
Finished 28000 entries
Finished 29000 entries
Finished 30000 entries
Finished 31000 entries
Finished 32000 entries
Finished 33000 entries
Finished 34000 entries
Finished 35000 entries
Finished 36000 entries
Finished 37000 entries
In [136]:
sequences = gene_seqs + notgene_seqs
len(sequences)
Out[136]:
72424
In [137]:
import numpy as np
labels = list(np.ones(len(gene_seqs))) + list(np.zeros(len(notgene_seqs)))
len(labels)
Out[137]:
72424
In [138]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

import warnings
warnings.filterwarnings('ignore')

# The LabelEncoder encodes a sequence of bases as a sequence of integers: 0, 1, 2 and 3
integer_encoder = LabelEncoder()  
# The OneHotEncoder converts an array of integers to a sparse matrix where 
# each row corresponds to one possible value of each feature, i.e. only 01 and 1 are present in the matrix
one_hot_encoder = OneHotEncoder()   
input_features = []

for sequence in sequences:
  integer_encoded = integer_encoder.fit_transform(list(sequence))
  integer_encoded = np.array(integer_encoded).reshape(-1, 1)
  one_hot_encoded = one_hot_encoder.fit_transform(integer_encoded)
  input_features.append(one_hot_encoded.toarray())

np.set_printoptions(threshold = 40)
#print(input_features.shape)
input_features = np.stack(input_features)
print("Example sequence\n-----------------------")
print('DNA Sequence #1:\n',sequences[0][:10],'...',sequences[0][-10:])
print('One hot encoding of Sequence #1:\n',input_features[0].T)
Example sequence
-----------------------
DNA Sequence #1:
 AAAATGTGCT ... TATTGATAAT
One hot encoding of Sequence #1:
 [[1. 1. 1. ... 1. 1. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 1.]]
In [139]:
one_hot_encoder = OneHotEncoder()
labels = np.array(labels).reshape(-1, 1)
input_labels = one_hot_encoder.fit_transform(labels).toarray()

print('Labels:\n',labels.T)
print('One-hot encoded labels:\n',input_labels.T)
Labels:
 [[1. 1. 1. ... 0. 0. 0.]]
One-hot encoded labels:
 [[0. 0. 0. ... 1. 1. 1.]
 [1. 1. 1. ... 0. 0. 0.]]
In [140]:
from sklearn.model_selection import train_test_split

train_features, test_features, train_labels, test_labels = train_test_split(
    input_features, input_labels, test_size = 0.2, random_state = 42)
In [141]:
train_features.shape
Out[141]:
(57939, 32, 4)
In [142]:
train_labels.shape
Out[142]:
(57939, 2)
In [143]:
test_features.shape
Out[143]:
(14485, 32, 4)
In [144]:
test_labels.shape
Out[144]:
(14485, 2)
In [145]:
from keras.optimizers import SGD, Adam, Adadelta
from keras.layers import Conv1D, Dense, MaxPooling1D, Flatten, Dropout, Embedding, Activation
from keras.models import Sequential
from keras.regularizers import l1

import warnings
warnings.filterwarnings('ignore')

model = Sequential()

model.add(Conv1D(filters = 16, kernel_size = 5, padding = 'same', kernel_initializer = 'he_uniform', 
                 input_shape = (train_features.shape[1], 4)))
model.add(Activation("relu"))
model.add(Conv1D(filters = 16, kernel_size = 5, padding = 'same', kernel_initializer = 'he_uniform'))
model.add(Activation("relu"))
model.add(MaxPooling1D(pool_size = 2))
model.add(Dropout(0.3))

model.add(Flatten())
model.add(Dense(8, kernel_initializer = 'he_uniform'))
model.add(Activation("relu"))
model.add(Dropout(0.5))
model.add(Dense(2, activation = 'softmax'))

epochs = 100
lrate = 0.01
decay = lrate / epochs
sgd = SGD(lr = lrate, momentum = 0.9, decay = decay, nesterov = False)
model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['binary_accuracy'])
#model.compile(loss='binary_crossentropy', optimizer=Adam(lr = lrate), metrics=['binary_accuracy'])
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv1d_17 (Conv1D)           (None, 32, 16)            336       
_________________________________________________________________
activation_25 (Activation)   (None, 32, 16)            0         
_________________________________________________________________
conv1d_18 (Conv1D)           (None, 32, 16)            1296      
_________________________________________________________________
activation_26 (Activation)   (None, 32, 16)            0         
_________________________________________________________________
max_pooling1d_9 (MaxPooling1 (None, 16, 16)            0         
_________________________________________________________________
dropout_5 (Dropout)          (None, 16, 16)            0         
_________________________________________________________________
flatten_9 (Flatten)          (None, 256)               0         
_________________________________________________________________
dense_17 (Dense)             (None, 8)                 2056      
_________________________________________________________________
activation_27 (Activation)   (None, 8)                 0         
_________________________________________________________________
dropout_6 (Dropout)          (None, 8)                 0         
_________________________________________________________________
dense_18 (Dense)             (None, 2)                 18        
=================================================================
Total params: 3,706
Trainable params: 3,706
Non-trainable params: 0
_________________________________________________________________
In [146]:
import warnings
warnings.filterwarnings('ignore')

history = model.fit(train_features, train_labels, 
                    epochs = epochs, verbose = 1, validation_split = 0.2, batch_size = 32, shuffle = True)
Train on 46351 samples, validate on 11588 samples
Epoch 1/100
46351/46351 [==============================] - 5s 102us/step - loss: 0.6907 - binary_accuracy: 0.5232 - val_loss: 0.6490 - val_binary_accuracy: 0.6607
Epoch 2/100
46351/46351 [==============================] - 4s 89us/step - loss: 0.6364 - binary_accuracy: 0.6278 - val_loss: 0.5689 - val_binary_accuracy: 0.7126
Epoch 3/100
46351/46351 [==============================] - 4s 87us/step - loss: 0.5789 - binary_accuracy: 0.6977 - val_loss: 0.5137 - val_binary_accuracy: 0.7563
Epoch 4/100
46351/46351 [==============================] - 4s 87us/step - loss: 0.5606 - binary_accuracy: 0.7147 - val_loss: 0.5096 - val_binary_accuracy: 0.7584
Epoch 5/100
46351/46351 [==============================] - 4s 87us/step - loss: 0.5509 - binary_accuracy: 0.7285 - val_loss: 0.5044 - val_binary_accuracy: 0.7545
Epoch 6/100
46351/46351 [==============================] - 4s 87us/step - loss: 0.5418 - binary_accuracy: 0.7337 - val_loss: 0.4975 - val_binary_accuracy: 0.7649
Epoch 7/100
46351/46351 [==============================] - 4s 87us/step - loss: 0.5369 - binary_accuracy: 0.7381 - val_loss: 0.5011 - val_binary_accuracy: 0.7616
Epoch 8/100
46351/46351 [==============================] - 4s 89us/step - loss: 0.5294 - binary_accuracy: 0.7380 - val_loss: 0.4886 - val_binary_accuracy: 0.7629
Epoch 9/100
46351/46351 [==============================] - 4s 88us/step - loss: 0.5293 - binary_accuracy: 0.7384 - val_loss: 0.4835 - val_binary_accuracy: 0.7670
Epoch 10/100
46351/46351 [==============================] - 4s 89us/step - loss: 0.5313 - binary_accuracy: 0.7413 - val_loss: 0.4842 - val_binary_accuracy: 0.7665
Epoch 11/100
46351/46351 [==============================] - 4s 88us/step - loss: 0.5268 - binary_accuracy: 0.7477 - val_loss: 0.4844 - val_binary_accuracy: 0.7682
Epoch 12/100
46351/46351 [==============================] - 4s 88us/step - loss: 0.5261 - binary_accuracy: 0.7470 - val_loss: 0.4870 - val_binary_accuracy: 0.7665
Epoch 13/100
46351/46351 [==============================] - 4s 90us/step - loss: 0.5220 - binary_accuracy: 0.7481 - val_loss: 0.4829 - val_binary_accuracy: 0.7662
Epoch 14/100
46351/46351 [==============================] - 4s 90us/step - loss: 0.5215 - binary_accuracy: 0.7446 - val_loss: 0.4806 - val_binary_accuracy: 0.7692
Epoch 15/100
46351/46351 [==============================] - 4s 90us/step - loss: 0.5209 - binary_accuracy: 0.7512 - val_loss: 0.4795 - val_binary_accuracy: 0.7702
Epoch 16/100
46351/46351 [==============================] - 4s 90us/step - loss: 0.5177 - binary_accuracy: 0.7487 - val_loss: 0.4832 - val_binary_accuracy: 0.7703
Epoch 17/100
46351/46351 [==============================] - 4s 92us/step - loss: 0.5193 - binary_accuracy: 0.7487 - val_loss: 0.4786 - val_binary_accuracy: 0.7698
Epoch 18/100
46351/46351 [==============================] - 4s 91us/step - loss: 0.5187 - binary_accuracy: 0.7501 - val_loss: 0.4821 - val_binary_accuracy: 0.7688
Epoch 19/100
46351/46351 [==============================] - 4s 90us/step - loss: 0.5164 - binary_accuracy: 0.7512 - val_loss: 0.4786 - val_binary_accuracy: 0.7704
Epoch 20/100
46351/46351 [==============================] - 4s 91us/step - loss: 0.5156 - binary_accuracy: 0.7545 - val_loss: 0.4787 - val_binary_accuracy: 0.7726
Epoch 21/100
46351/46351 [==============================] - 4s 90us/step - loss: 0.5172 - binary_accuracy: 0.7528 - val_loss: 0.4779 - val_binary_accuracy: 0.7710
Epoch 22/100
46351/46351 [==============================] - 4s 89us/step - loss: 0.5147 - binary_accuracy: 0.7532 - val_loss: 0.4773 - val_binary_accuracy: 0.7694
Epoch 23/100
46351/46351 [==============================] - 4s 91us/step - loss: 0.5131 - binary_accuracy: 0.7537 - val_loss: 0.4744 - val_binary_accuracy: 0.7712
Epoch 24/100
46351/46351 [==============================] - 4s 90us/step - loss: 0.5164 - binary_accuracy: 0.7525 - val_loss: 0.4780 - val_binary_accuracy: 0.7722
Epoch 25/100
46351/46351 [==============================] - 5s 105us/step - loss: 0.5131 - binary_accuracy: 0.7529 - val_loss: 0.4749 - val_binary_accuracy: 0.7720
Epoch 26/100
46351/46351 [==============================] - 5s 102us/step - loss: 0.5123 - binary_accuracy: 0.7563 - val_loss: 0.4763 - val_binary_accuracy: 0.7718
Epoch 27/100
46351/46351 [==============================] - 4s 92us/step - loss: 0.5145 - binary_accuracy: 0.7512 - val_loss: 0.4742 - val_binary_accuracy: 0.7733
Epoch 28/100
46351/46351 [==============================] - 4s 90us/step - loss: 0.5154 - binary_accuracy: 0.7498 - val_loss: 0.4775 - val_binary_accuracy: 0.7740
Epoch 29/100
46351/46351 [==============================] - 4s 89us/step - loss: 0.5130 - binary_accuracy: 0.7556 - val_loss: 0.4742 - val_binary_accuracy: 0.7721
Epoch 30/100
46351/46351 [==============================] - 4s 91us/step - loss: 0.5130 - binary_accuracy: 0.7543 - val_loss: 0.4743 - val_binary_accuracy: 0.7748
Epoch 31/100
46351/46351 [==============================] - 4s 90us/step - loss: 0.5105 - binary_accuracy: 0.7538 - val_loss: 0.4744 - val_binary_accuracy: 0.7730
Epoch 32/100
46351/46351 [==============================] - 5s 115us/step - loss: 0.5127 - binary_accuracy: 0.7533 - val_loss: 0.4741 - val_binary_accuracy: 0.7753
Epoch 33/100
46351/46351 [==============================] - 4s 91us/step - loss: 0.5093 - binary_accuracy: 0.7575 - val_loss: 0.4756 - val_binary_accuracy: 0.7753
Epoch 34/100
46351/46351 [==============================] - 4s 91us/step - loss: 0.5117 - binary_accuracy: 0.7600 - val_loss: 0.4746 - val_binary_accuracy: 0.7749
Epoch 35/100
46351/46351 [==============================] - 4s 91us/step - loss: 0.5107 - binary_accuracy: 0.7570 - val_loss: 0.4760 - val_binary_accuracy: 0.7721
Epoch 36/100
46351/46351 [==============================] - 5s 102us/step - loss: 0.5117 - binary_accuracy: 0.7570 - val_loss: 0.4785 - val_binary_accuracy: 0.7712
Epoch 37/100
46351/46351 [==============================] - 5s 100us/step - loss: 0.5109 - binary_accuracy: 0.7538 - val_loss: 0.4742 - val_binary_accuracy: 0.7749
Epoch 38/100
46351/46351 [==============================] - 4s 93us/step - loss: 0.5093 - binary_accuracy: 0.7566 - val_loss: 0.4717 - val_binary_accuracy: 0.7750
Epoch 39/100
46351/46351 [==============================] - 4s 92us/step - loss: 0.5104 - binary_accuracy: 0.7581 - val_loss: 0.4718 - val_binary_accuracy: 0.7727
Epoch 40/100
46351/46351 [==============================] - 4s 92us/step - loss: 0.5095 - binary_accuracy: 0.7576 - val_loss: 0.4708 - val_binary_accuracy: 0.7745
Epoch 41/100
46351/46351 [==============================] - 5s 117us/step - loss: 0.5102 - binary_accuracy: 0.7560 - val_loss: 0.4742 - val_binary_accuracy: 0.7735
Epoch 42/100
46351/46351 [==============================] - 5s 105us/step - loss: 0.5095 - binary_accuracy: 0.7595 - val_loss: 0.4739 - val_binary_accuracy: 0.7738
Epoch 43/100
46351/46351 [==============================] - 5s 105us/step - loss: 0.5059 - binary_accuracy: 0.7566 - val_loss: 0.4708 - val_binary_accuracy: 0.7742
Epoch 44/100
46351/46351 [==============================] - 4s 92us/step - loss: 0.5074 - binary_accuracy: 0.7582 - val_loss: 0.4714 - val_binary_accuracy: 0.7743
Epoch 45/100
46351/46351 [==============================] - 4s 92us/step - loss: 0.5079 - binary_accuracy: 0.7608 - val_loss: 0.4717 - val_binary_accuracy: 0.7749
Epoch 46/100
46351/46351 [==============================] - 4s 91us/step - loss: 0.5053 - binary_accuracy: 0.7619 - val_loss: 0.4726 - val_binary_accuracy: 0.7758
Epoch 47/100
46351/46351 [==============================] - 4s 93us/step - loss: 0.5095 - binary_accuracy: 0.7610 - val_loss: 0.4708 - val_binary_accuracy: 0.7747
Epoch 48/100
46351/46351 [==============================] - 5s 111us/step - loss: 0.5058 - binary_accuracy: 0.7592 - val_loss: 0.4708 - val_binary_accuracy: 0.7737
Epoch 49/100
46351/46351 [==============================] - 4s 91us/step - loss: 0.5061 - binary_accuracy: 0.7593 - val_loss: 0.4719 - val_binary_accuracy: 0.7755
Epoch 50/100
46351/46351 [==============================] - 4s 92us/step - loss: 0.5077 - binary_accuracy: 0.7577 - val_loss: 0.4724 - val_binary_accuracy: 0.7753
Epoch 51/100
46351/46351 [==============================] - 5s 110us/step - loss: 0.5057 - binary_accuracy: 0.7603 - val_loss: 0.4697 - val_binary_accuracy: 0.7736
Epoch 52/100
46351/46351 [==============================] - 6s 120us/step - loss: 0.5054 - binary_accuracy: 0.7608 - val_loss: 0.4701 - val_binary_accuracy: 0.7745
Epoch 53/100
46351/46351 [==============================] - 4s 86us/step - loss: 0.5066 - binary_accuracy: 0.7589 - val_loss: 0.4702 - val_binary_accuracy: 0.7744
Epoch 54/100
46351/46351 [==============================] - 5s 100us/step - loss: 0.5063 - binary_accuracy: 0.7627 - val_loss: 0.4709 - val_binary_accuracy: 0.7734
Epoch 55/100
46351/46351 [==============================] - 4s 88us/step - loss: 0.5060 - binary_accuracy: 0.7638 - val_loss: 0.4736 - val_binary_accuracy: 0.7747
Epoch 56/100
46351/46351 [==============================] - 4s 89us/step - loss: 0.5039 - binary_accuracy: 0.7646 - val_loss: 0.4706 - val_binary_accuracy: 0.7726
Epoch 57/100
46351/46351 [==============================] - 4s 91us/step - loss: 0.5051 - binary_accuracy: 0.7639 - val_loss: 0.4726 - val_binary_accuracy: 0.7737
Epoch 58/100
46351/46351 [==============================] - 5s 113us/step - loss: 0.5042 - binary_accuracy: 0.7644 - val_loss: 0.4686 - val_binary_accuracy: 0.7737
Epoch 59/100
46351/46351 [==============================] - 5s 99us/step - loss: 0.5028 - binary_accuracy: 0.7664 - val_loss: 0.4694 - val_binary_accuracy: 0.7752
Epoch 60/100
46351/46351 [==============================] - 4s 90us/step - loss: 0.5029 - binary_accuracy: 0.7658 - val_loss: 0.4711 - val_binary_accuracy: 0.7752
Epoch 61/100
46351/46351 [==============================] - 5s 101us/step - loss: 0.5054 - binary_accuracy: 0.7665 - val_loss: 0.4698 - val_binary_accuracy: 0.7753
Epoch 62/100
46351/46351 [==============================] - 4s 93us/step - loss: 0.4996 - binary_accuracy: 0.7666 - val_loss: 0.4711 - val_binary_accuracy: 0.7769
Epoch 63/100
46351/46351 [==============================] - 5s 110us/step - loss: 0.5041 - binary_accuracy: 0.7656 - val_loss: 0.4699 - val_binary_accuracy: 0.7757
Epoch 64/100
46351/46351 [==============================] - 5s 108us/step - loss: 0.4986 - binary_accuracy: 0.7664 - val_loss: 0.4698 - val_binary_accuracy: 0.7770
Epoch 65/100
46351/46351 [==============================] - 5s 101us/step - loss: 0.5019 - binary_accuracy: 0.7655 - val_loss: 0.4687 - val_binary_accuracy: 0.7749
Epoch 66/100
46351/46351 [==============================] - 4s 89us/step - loss: 0.5018 - binary_accuracy: 0.7656 - val_loss: 0.4695 - val_binary_accuracy: 0.7749
Epoch 67/100
46351/46351 [==============================] - 4s 89us/step - loss: 0.5026 - binary_accuracy: 0.7667 - val_loss: 0.4687 - val_binary_accuracy: 0.7754
Epoch 68/100
46351/46351 [==============================] - 4s 89us/step - loss: 0.5004 - binary_accuracy: 0.7674 - val_loss: 0.4698 - val_binary_accuracy: 0.7739
Epoch 69/100
46351/46351 [==============================] - 4s 90us/step - loss: 0.5012 - binary_accuracy: 0.7685 - val_loss: 0.4680 - val_binary_accuracy: 0.7741
Epoch 70/100
46351/46351 [==============================] - 4s 88us/step - loss: 0.5005 - binary_accuracy: 0.7673 - val_loss: 0.4722 - val_binary_accuracy: 0.7742
Epoch 71/100
46351/46351 [==============================] - 4s 88us/step - loss: 0.5013 - binary_accuracy: 0.7674 - val_loss: 0.4692 - val_binary_accuracy: 0.7766
Epoch 72/100
46351/46351 [==============================] - 4s 88us/step - loss: 0.5010 - binary_accuracy: 0.7649 - val_loss: 0.4686 - val_binary_accuracy: 0.7767
Epoch 73/100
46351/46351 [==============================] - 5s 113us/step - loss: 0.5004 - binary_accuracy: 0.7666 - val_loss: 0.4695 - val_binary_accuracy: 0.7747
Epoch 74/100
46351/46351 [==============================] - 5s 99us/step - loss: 0.5001 - binary_accuracy: 0.7674 - val_loss: 0.4688 - val_binary_accuracy: 0.7780
Epoch 75/100
46351/46351 [==============================] - 5s 106us/step - loss: 0.4991 - binary_accuracy: 0.7681 - val_loss: 0.4706 - val_binary_accuracy: 0.7746
Epoch 76/100
46351/46351 [==============================] - 5s 114us/step - loss: 0.4985 - binary_accuracy: 0.7678 - val_loss: 0.4697 - val_binary_accuracy: 0.7760
Epoch 77/100
46351/46351 [==============================] - 4s 92us/step - loss: 0.5017 - binary_accuracy: 0.7670 - val_loss: 0.4684 - val_binary_accuracy: 0.7751
Epoch 78/100
46351/46351 [==============================] - 4s 93us/step - loss: 0.5008 - binary_accuracy: 0.7677 - val_loss: 0.4680 - val_binary_accuracy: 0.7750
Epoch 79/100
46351/46351 [==============================] - 4s 96us/step - loss: 0.5020 - binary_accuracy: 0.7664 - val_loss: 0.4691 - val_binary_accuracy: 0.7759
Epoch 80/100
46351/46351 [==============================] - 6s 129us/step - loss: 0.5007 - binary_accuracy: 0.7670 - val_loss: 0.4670 - val_binary_accuracy: 0.7774
Epoch 81/100
46351/46351 [==============================] - 4s 95us/step - loss: 0.4983 - binary_accuracy: 0.7678 - val_loss: 0.4672 - val_binary_accuracy: 0.7765
Epoch 82/100
46351/46351 [==============================] - 4s 88us/step - loss: 0.4989 - binary_accuracy: 0.7701 - val_loss: 0.4683 - val_binary_accuracy: 0.7775
Epoch 83/100
46351/46351 [==============================] - 4s 89us/step - loss: 0.4980 - binary_accuracy: 0.7691 - val_loss: 0.4681 - val_binary_accuracy: 0.7770
Epoch 84/100
46351/46351 [==============================] - 4s 90us/step - loss: 0.5005 - binary_accuracy: 0.7663 - val_loss: 0.4689 - val_binary_accuracy: 0.7761
Epoch 85/100
46351/46351 [==============================] - 4s 93us/step - loss: 0.4972 - binary_accuracy: 0.7684 - val_loss: 0.4680 - val_binary_accuracy: 0.7749
Epoch 86/100
46351/46351 [==============================] - 4s 93us/step - loss: 0.5005 - binary_accuracy: 0.7685 - val_loss: 0.4678 - val_binary_accuracy: 0.7761
Epoch 87/100
46351/46351 [==============================] - 4s 94us/step - loss: 0.4993 - binary_accuracy: 0.7685 - val_loss: 0.4672 - val_binary_accuracy: 0.7760
Epoch 88/100
46351/46351 [==============================] - 5s 113us/step - loss: 0.4972 - binary_accuracy: 0.7702 - val_loss: 0.4689 - val_binary_accuracy: 0.7765
Epoch 89/100
46351/46351 [==============================] - 5s 110us/step - loss: 0.4982 - binary_accuracy: 0.7685 - val_loss: 0.4666 - val_binary_accuracy: 0.7768
Epoch 90/100
46351/46351 [==============================] - 4s 95us/step - loss: 0.4986 - binary_accuracy: 0.7691 - val_loss: 0.4677 - val_binary_accuracy: 0.7781
Epoch 91/100
46351/46351 [==============================] - 4s 90us/step - loss: 0.4985 - binary_accuracy: 0.7693 - val_loss: 0.4688 - val_binary_accuracy: 0.7770
Epoch 92/100
46351/46351 [==============================] - 4s 94us/step - loss: 0.4964 - binary_accuracy: 0.7698 - val_loss: 0.4661 - val_binary_accuracy: 0.7774
Epoch 93/100
46351/46351 [==============================] - 4s 93us/step - loss: 0.4986 - binary_accuracy: 0.7697 - val_loss: 0.4668 - val_binary_accuracy: 0.7774
Epoch 94/100
46351/46351 [==============================] - 4s 96us/step - loss: 0.4977 - binary_accuracy: 0.7694 - val_loss: 0.4664 - val_binary_accuracy: 0.7783
Epoch 95/100
46351/46351 [==============================] - 6s 137us/step - loss: 0.4987 - binary_accuracy: 0.7697 - val_loss: 0.4674 - val_binary_accuracy: 0.7779
Epoch 96/100
46351/46351 [==============================] - 5s 101us/step - loss: 0.4969 - binary_accuracy: 0.7711 - val_loss: 0.4685 - val_binary_accuracy: 0.7764
Epoch 97/100
46351/46351 [==============================] - 4s 94us/step - loss: 0.4974 - binary_accuracy: 0.7703 - val_loss: 0.4668 - val_binary_accuracy: 0.7763
Epoch 98/100
46351/46351 [==============================] - 4s 95us/step - loss: 0.4982 - binary_accuracy: 0.7701 - val_loss: 0.4665 - val_binary_accuracy: 0.7780
Epoch 99/100
46351/46351 [==============================] - 6s 138us/step - loss: 0.4981 - binary_accuracy: 0.7710 - val_loss: 0.4661 - val_binary_accuracy: 0.7805
Epoch 100/100
46351/46351 [==============================] - 5s 101us/step - loss: 0.4989 - binary_accuracy: 0.7713 - val_loss: 0.4675 - val_binary_accuracy: 0.7780
In [147]:
import matplotlib.pyplot as plt

plt.figure(figsize=(20,15))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss', fontsize = 20)
plt.ylabel('Loss', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()

plt.figure(figsize=(20,15))
plt.plot(history.history['binary_accuracy'])
plt.plot(history.history['val_binary_accuracy'])
plt.title('Model Accuracy', fontsize = 20)
plt.ylabel('Accuracy', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()
In [148]:
from sklearn.metrics import confusion_matrix
import itertools

plt.figure(figsize=(15,10))

predicted_labels = model.predict(np.stack(test_features))
cm = confusion_matrix(np.argmax(test_labels, axis=1), 
                      np.argmax(predicted_labels, axis=1))
print('Confusion matrix:\n',cm)

cm = cm.astype('float') / cm.sum(axis = 1)[:, np.newaxis]

plt.imshow(cm, cmap = plt.cm.Blues)
plt.title('Normalized confusion matrix', fontsize = 20)
plt.colorbar()
plt.xlabel('True label', fontsize = 20)
plt.ylabel('Predicted label', fontsize = 20)
#plt.xticks([0, 1]); plt.yticks([0, 1])
#plt.grid('off')
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
    plt.text(j, i, format(cm[i, j], '.2f'),
             horizontalalignment = 'center', verticalalignment = 'center', fontsize = 20,
             color='white' if cm[i, j] > 0.5 else 'black')
plt.show()
Confusion matrix:
 [[5848 1353]
 [1859 5425]]
In [149]:
scores = model.evaluate(test_features, test_labels, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
Accuracy: 77.83%
In [150]:
import keras.backend as K

def compute_salient_bases(model, x):
    input_tensors = [model.input]
    gradients = model.optimizer.get_gradients(model.output[0][1], model.input)
    compute_gradients = K.function(inputs = input_tensors, outputs = gradients)
    
    x_value = np.expand_dims(x, axis=0)
    gradients = compute_gradients([x_value])[0][0]
    sal = np.clip(np.sum(np.multiply(gradients,x), axis=1),a_min=0, a_max=None)
    return sal
In [159]:
sequence_index = 12
K.set_learning_phase(1) #set learning phase
sal = compute_salient_bases(model, input_features[sequence_index])

plt.figure(figsize=[16,5])
barlist = plt.bar(np.arange(len(sal)), sal)
[barlist[i].set_color('C1') for i in range(0,6)]
[barlist[j].set_color('C1') for j in range(26,32)]
plt.xlabel('Bases')
plt.ylabel('Magnitude of saliency values')
plt.xticks(np.arange(len(sal)), list(sequences[sequence_index]));
plt.title('Saliency map for bases in one of the ancient sequences'
          ' (orange indicates the informative bases in motif)');
plt.show()

This looks much much better! The model can distinguish between segments coming from genes and not genes with close to 80% accuracy, which can be probably improved by considering longer stretches of DNA. So one reason this kind of classification did not work for Neanderthal introgressed vs. depleted sequences can be due to the poor callcing / annotation provided by Vernot and Akey in their Science 2014 paper. As a prove that there is something wrong with the introgressed sequence calling from Vernot and Akey 2014, we can calculate how many times their sequences overlap the refseq genes:

In [15]:
%%bash
Path=/home/nikolay/WABI/Misc/AncientDNA/NeandIntrogression/1000G_Akey_NeandIntrogression
bedtools intersect -a ./genes/gene_coords.txt -b $Path/all_haplotypes_populations.bed.all | wc -l
15590

and now let us compare this number with how many times their sequences overlap with intergenic regions:

In [16]:
%%bash
Path=/home/nikolay/WABI/Misc/AncientDNA/NeandIntrogression/1000G_Akey_NeandIntrogression
bedtools intersect -a ./genes/notgene_coords.txt -b $Path/all_haplotypes_populations.bed.all | wc -l
14958

Here we see that there is at least no enrichment of the sequences in either gene or intergenic regions. In contrast, it looks like the nNeanderthal introgressed sequences seem to be slightly enriched in the gene regions which contradicts one of the main conclusions of David Reich, Nature 2014, where they explicitly state that Neanderthal sequences seem to be depleted from the gene regions, which in evolutionary context can be viewed as Natural Selection did not prefer mixing with Neanderthals for the sake of fitness. Now we will try to use the better annotation from Vernot and Akey, Science 2016, and the annotation from David Reich, Nature 2016, and redo the Neanderthal introgressed vs. depeleted sequence classification.

Better Annotation of Neanderthal Introgressed Haplotypes

We downloaded the merged maps of Neandertal introgression for European + Asian 1000G populations published in Vernot and Akey, Science, 2014 from here https://akeylab.princeton.edu/downloads.html. However, even though the data looked strange (the segments looked too different from the ones from Reich, Nature 2014) and we could not establish any good model distinguishing between Neanderthal introgressed and depleted regions. Next we downloaded the recall data from Vernot et al., Science 2016, from here https://drive.google.com/drive/folders/0B9Pc7_zItMCVWUp6bWtXc2xJVkk. Btw here we have an opposite results of intersecting the introgressed regions with gene and non-gene regions.

In [21]:
%%bash
Path=/home/nikolay/WABI/Misc/AncientDNA/NeandIntrogression
bedtools intersect -a ./genes/gene_coords.txt -b $Path/Akey_intr_coords.bed | wc -l
178302
In [22]:
%%bash
Path=/home/nikolay/WABI/Misc/AncientDNA/NeandIntrogression
bedtools intersect -a ./genes/notgene_coords.txt -b $Path/Akey_intr_coords.bed | wc -l
188770

Here it is clear that the Neanderthal introgression regions overlap more often with the intergenic regions, i.e. the genes are depleted for Neanderthal introgression showing evolutionary losses in fitness via inbreeding of modern humans with Neanderthals. The same effect we can see with the Neanderthal introgressed regions from David Reich, 2016, after we have selected the most confident regions, i.e. with probability >=90%. Btw, Recih and colleagues provide candidate regions with probability >=50%, but to be on a safe side we select only most confident intervals.

In [19]:
%%bash
Path=/home/nikolay/WABI/Misc/AncientDNA/NeandIntrogression
bedtools intersect -a ./genes/gene_coords.txt -b $Path/Reich_intr_coords_high_conf.bed | wc -l
72394
In [20]:
%%bash
Path=/home/nikolay/WABI/Misc/AncientDNA/NeandIntrogression
bedtools intersect -a ./genes/notgene_coords.txt -b $Path/Reich_intr_coords_high_conf.bed | wc -l
73045

We can also do something more intelligent and compute the empirical histogram of intersects with randomly constructed non-gene regions. In other words, we are going to create notgene_coords.txt file multiple times and run bedtools to count the number of intersects with the Neanderthal introgressed regions from Vernot and Akey, Science 2016, in this way we obtain an empirical histogram. The goal is to show that the observed number of intersects 178302 is extreme, i.e. located in the 5% tail of the histogram.

In [23]:
chr_sizes = pd.read_csv("hg19.fa.gz.fai", header = None, sep = "\t")
chr_sizes.head()
Out[23]:
0 1 2 3 4
0 chr1 249250621 6 50 51
1 chr2 243199373 254235646 50 51
2 chr3 198022430 502299013 50 51
3 chr4 191154276 704281898 50 51
4 chr5 180915260 899259266 50 51
In [29]:
import pandas as pd
Path = '/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/genes/'
gene_coords = pd.read_csv(Path + 'gene_coords.txt', header=None, sep="\t")
gene_lengths = gene_coords.iloc[:, 2]-gene_coords.iloc[:, 1]
gene_coords.head()
Out[29]:
0 1 2
0 chr4 419224 467918
1 chr2 54197975 54532437
2 chr1 154554538 154600475
3 chr7 45613739 45762715
4 chr10 111756126 111895323
In [116]:
import os
import numpy as np
import subprocess
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/')

#perm_n = []
for k in range(100):
    chr_list = []
    start_list = []
    end_list = []
    for i in range(gene_coords.shape[0]):
        chr_df = gene_coords[gene_coords[0] == gene_coords[0][i]]
        b1 = chr_df[1][i]
        b2 = chr_df[2][i]
        overlap = True
        while overlap == True:
            reg_start = np.random.randint(1, int(chr_sizes[chr_sizes[0] == gene_coords[0][i]].iloc[:,1]))
            reg_end = reg_start + gene_lengths[i]
            for j in range(chr_df.shape[0]):
                if (reg_start > b1 and reg_start < b2) or (reg_end > b1 and reg_end < b2):
                    overlap = True
                    break
                else:
                    overlap = False
        chr_list.append(gene_coords[0][i])
        start_list.append(reg_start)
        end_list.append(reg_end)
    notgene_rand_coords = pd.DataFrame({'0': chr_list, '1': start_list, '2': end_list})
    notgene_rand_coords.to_csv("temp.txt", index = False, header = False, sep = "\t")

    akey_path = '/home/nikolay/WABI/Misc/AncientDNA/NeandIntrogression/'
    with open('n.txt', 'w') as fp:
        subprocess.run(['bedtools', 'intersect', '-a', 'temp.txt', '-b', 
                        akey_path + 'Akey_intr_coords.bed'], stdout = fp)
    akey_n = pd.read_csv('n.txt', header=None, sep="\t")
    print(k, akey_n.shape[0])
    perm_n.append(akey_n.shape[0])
0 184647
1 183123
2 185392
3 181554
4 186290
5 187355
6 188643
7 189258
8 194065
9 186030
10 182065
11 187057
12 189810
13 187555
14 185527
15 187557
16 187103
17 184783
18 187159
19 181950
20 184191
21 186063
22 193518
23 184610
24 187984
25 188551
26 182710
27 188647
28 183398
29 186143
30 190844
31 190008
32 187022
33 190161
34 184268
35 188995
36 187889
37 187704
38 185325
39 184652
40 188878
41 182010
42 193107
43 185832
44 189179
45 186713
46 184271
47 185499
48 185950
49 188265
50 187779
51 185132
52 183413
53 180461
54 188399
55 191272
56 187562
57 186974
58 183863
59 187954
60 189880
61 188227
62 186209
63 191777
64 187242
65 190216
66 188676
67 188573
68 186853
69 187892
70 184691
71 185731
72 185364
73 187485
74 183887
75 185852
76 190875
77 189783
78 186476
79 186308
80 183679
81 185974
82 186142
83 190332
84 185819
85 189634
86 189852
87 184452
88 187546
89 183656
90 190531
91 188758
92 191035
93 188377
94 188030
95 190966
96 187595
97 187195
98 190276
99 186758
In [117]:
%%bash
Path=/home/nikolay/WABI/Misc/AncientDNA/NeandIntrogression
bedtools intersect -a temp.txt -b $Path/Akey_intr_coords.bed | wc -l
186758
In [118]:
len(perm_n)
Out[118]:
300
In [120]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
plt.xlim([170000,200000])
plt.axvline(x=178302,linewidth=4, color='r')
sns.distplot(perm_n)
plt.title("Distribution of Non-Gene Intersects: Vernot and Akey, Science 2016", fontsize = 20)
plt.xlabel("Number of Intersects Between Neanderthal Introgressed and Non-Gene Regions", fontsize = 20)
plt.ylabel("Frequency", fontsize = 20)
plt.show()

The right vertical line here is the number of intersects between the Neanderthal introgressed and gene regions. The blue histogram represents the number of intersects between the Neanderthal introgressed and intergenic regions. The intergenic regions have exactly the same length as the Neanderthal introgressed regions and we drew multiple times exactly the same number of intergenic regions as the number of Neandrthal introgressed regions. What we see from this figure is that Neanderthal introgressed regions are significantly enriched in the intergenic regions rather than gene regions. Therefore, for some reason Evolution did not want to keep Neanderthal sequences within genes due to some unknown loss in fitness, but instead Evolution tried to get rid of the Neanderthal sequences from the genes of modern humans.

This was the Neanderthal introgression coordinates from Vernot and Akey, Science 2016. Let us now use the Neanderthal introgression coordinates from David Reich and colleagues from Nature 2016, and check whether we observe the same type of enrichment. Again, the hypothesis is that the number 72394 of intersects between Reich's Neanderthal introgression regions and gene regions is significantly lower than the corresponding typical number of intersects from the intergenic segments.

In [145]:
import os
import numpy as np
import subprocess
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/')

#perm_n_reich = []
for k in range(160):
    chr_list = []
    start_list = []
    end_list = []
    for i in range(gene_coords.shape[0]):
        chr_df = gene_coords[gene_coords[0] == gene_coords[0][i]]
        b1 = chr_df[1][i]
        b2 = chr_df[2][i]
        overlap = True
        while overlap == True:
            reg_start = np.random.randint(1, int(chr_sizes[chr_sizes[0] == gene_coords[0][i]].iloc[:,1]))
            reg_end = reg_start + gene_lengths[i]
            for j in range(chr_df.shape[0]):
                if (reg_start > b1 and reg_start < b2) or (reg_end > b1 and reg_end < b2):
                    overlap = True
                    break
                else:
                    overlap = False
        chr_list.append(gene_coords[0][i])
        start_list.append(reg_start)
        end_list.append(reg_end)
    notgene_rand_coords = pd.DataFrame({'0': chr_list, '1': start_list, '2': end_list})
    notgene_rand_coords.to_csv("temp.txt", index = False, header = False, sep = "\t")

    reich_path = '/home/nikolay/WABI/Misc/AncientDNA/NeandIntrogression/'
    with open('n.txt', 'w') as fp:
        subprocess.run(['bedtools', 'intersect', '-a', 'temp.txt', '-b', 
                        reich_path + 'Reich_intr_coords_high_conf.bed'], stdout = fp)
    reich_n = pd.read_csv('n.txt', header=None, sep="\t")
    print(k, reich_n.shape[0])
    perm_n_reich.append(reich_n.shape[0])
0 72912
1 71643
2 73139
3 73243
4 74231
5 73126
6 74026
7 74136
8 73842
9 72383
10 73263
11 74197
12 73475
13 73098
14 73855
15 73651
16 72424
17 74749
18 74040
19 74835
20 73226
21 72868
22 73243
23 73276
24 73343
25 71596
26 73828
27 74699
28 74428
29 72917
30 73574
31 74503
32 73750
33 74373
34 73723
35 73857
36 73883
37 74381
38 72799
39 71887
40 72141
41 73893
42 72108
43 73711
44 74055
45 72951
46 74067
47 73795
48 73848
49 73471
50 73069
51 73274
52 73440
53 74667
54 74054
55 74523
56 74110
57 72477
58 75500
59 72169
60 73074
61 72678
62 73297
63 74751
64 73622
65 73495
66 73283
67 73035
68 73897
69 73971
70 75283
71 73608
72 74628
73 72562
74 74662
75 72943
76 74023
77 72236
78 74681
79 74010
80 74942
81 75241
82 72939
83 72913
84 74941
85 72323
86 73728
87 72553
88 73606
89 74532
90 73929
91 73488
92 74013
93 73763
94 72755
95 74237
96 74984
97 74475
98 73512
99 72655
100 74653
101 72797
102 73875
103 73788
104 72733
105 73073
106 74141
107 73463
108 73026
109 73761
110 74062
111 72137
112 72836
113 73384
114 73168
115 74273
116 72488
117 73209
118 73914
119 73405
120 72334
121 71874
122 73894
123 72168
124 74196
125 72975
126 72936
127 72314
128 72172
129 73397
130 72005
131 73665
132 73895
133 72850
134 73542
135 73017
136 71378
137 73455
138 74100
139 72599
140 73678
141 73072
142 71167
143 73109
144 71896
145 75692
146 74554
147 73483
148 72521
149 74255
150 74853
151 73641
152 74467
153 72364
154 73924
155 73398
156 74631
157 73156
158 74366
159 73491
In [146]:
%%bash
Path=/home/nikolay/WABI/Misc/AncientDNA/NeandIntrogression
bedtools intersect -a temp.txt -b $Path/Reich_intr_coords_high_conf.bed | wc -l
73491
In [147]:
len(perm_n_reich)
Out[147]:
300
In [148]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
plt.xlim([69000,78000])
plt.axvline(x = 72394,linewidth = 4, color = 'r')
sns.distplot(perm_n_reich)
plt.title("Distribution of Non-Gene Intersects: Reich, Nature 2016", fontsize = 20)
plt.xlabel("Number of Intersects Between Neanderthal Introgressed and Non-Gene Regions", fontsize = 20)
plt.ylabel("Frequency", fontsize = 20)
plt.show()

We can see that Reich introgression segments do not really look to be enriched in the intergenic regions. This is strange, perhaps one could increase the confidence of Neanderthal introgression from 90% to 99%. Anyhow, we will continue for now with Akey's segments.

In [149]:
import pandas as pd
Path = '/home/nikolay/WABI/Misc/AncientDNA/NeandIntrogression/'
intr_coords = pd.read_csv(Path + 'Akey_intr_coords.bed', header = None, sep = "\t")
intr_coords.head()
Out[149]:
0 1 2
0 chr1 2903159 2915884
1 chr1 2932446 2972497
2 chr1 2960608 2996556
3 chr1 2960608 2999518
4 chr1 2960608 3001253
In [150]:
intr_coords.shape
Out[150]:
(83601, 3)
In [151]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
intr_lengths = intr_coords.iloc[:, 2]-intr_coords.iloc[:, 1]
sns.distplot(intr_lengths)
plt.title("Distribution of Lengths of Neandertal Introgressed Regions", fontsize = 20)
plt.xlabel("Lengths of Neandertal Introgressed Regions", fontsize = 20)
plt.ylabel("Frequency", fontsize = 20)
plt.show()
In [152]:
from scipy import stats
print(stats.describe(intr_lengths))
DescribeResult(nobs=83601, minmax=(10002, 1194940), mean=92137.5803877944, variance=6139175379.516866, skewness=2.7095069302166377, kurtosis=11.38558261567325)
In [153]:
import os
import subprocess
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/')
a = 0
with open('hg19_introgr_regions.fa', 'a') as fp:
    for i in range(intr_coords.shape[0]):
        coord = str(str(intr_coords.iloc[i, 0]) + ':' 
                    + str(intr_coords.iloc[i, 1]) + '-' + str(intr_coords.iloc[i, 2]))
        subprocess.run(['samtools', 'faidx', 'hg19.fa.gz', str(coord)], stdout = fp)
        a = a + 1
        if a%10000 == 0:
            print('Finished ' + str(a) + ' Neanderthal introgressed haplotypes')
Finished 10000 Neanderthal introgressed haplotypes
Finished 20000 Neanderthal introgressed haplotypes
Finished 30000 Neanderthal introgressed haplotypes
Finished 40000 Neanderthal introgressed haplotypes
Finished 50000 Neanderthal introgressed haplotypes
Finished 60000 Neanderthal introgressed haplotypes
Finished 70000 Neanderthal introgressed haplotypes
Finished 80000 Neanderthal introgressed haplotypes
In [154]:
chr_sizes = pd.read_csv("hg19.fa.gz.fai", header=None, sep="\t")
chr_sizes.head()
Out[154]:
0 1 2 3 4
0 chr1 249250621 6 50 51
1 chr2 243199373 254235646 50 51
2 chr3 198022430 502299013 50 51
3 chr4 191154276 704281898 50 51
4 chr5 180915260 899259266 50 51
In [155]:
chr_list = []
start_list = []
end_list = []
for i in range(intr_coords.shape[0]):
    chr_df = intr_coords[intr_coords[0] == intr_coords[0][i]]
    b1 = chr_df[1][i]
    b2 = chr_df[2][i]
    overlap = True
    while overlap == True:
        reg_start = np.random.randint(1, int(chr_sizes[chr_sizes[0] == intr_coords[0][i]].iloc[:,1]))
        reg_end = reg_start + intr_lengths[i]
        for j in range(chr_df.shape[0]):
            if (reg_start > b1 and reg_start < b2) or (reg_end > b1 and reg_end < b2):
                overlap = True
                break
            else:
                overlap = False
    chr_list.append(intr_coords[0][i])
    start_list.append(reg_start)
    end_list.append(reg_end)
depl_coords = pd.DataFrame({'0': chr_list, '1': start_list, '2': end_list})
depl_coords.head()
Out[155]:
0 1 2
0 chr1 207924105 207936830
1 chr1 176419320 176459371
2 chr1 95219023 95254971
3 chr1 188361584 188400494
4 chr1 98447184 98487829
In [156]:
import os
import subprocess
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/')
with open('hg19_depl_regions.fa', 'a') as fp:
    for i in range(depl_coords.shape[0]):
        coord = str(str(depl_coords.iloc[i, 0]) + ':' 
                    + str(depl_coords.iloc[i, 1]) + '-' + str(depl_coords.iloc[i, 2]))
        subprocess.run(['samtools', 'faidx', 'hg19.fa.gz', str(coord)], stdout = fp)
In [157]:
!grep -c N hg19_introgr_regions.fa
863
In [158]:
!grep -c N hg19_depl_regions.fa
8195300
In [159]:
from Bio import SeqIO
i = 0
for record in SeqIO.parse('hg19_introgr_regions.fa', 'fasta'):
    upper_record = record.seq.upper()
    if 'N' in upper_record:
        i = i + 1
print(i)

i = 0
for record in SeqIO.parse('hg19_depl_regions.fa', 'fasta'):
    upper_record = record.seq.upper()
    if 'N' in upper_record:
        i = i + 1
print(i)
54
5941
In [160]:
from Bio import SeqIO

intr_file = 'hg19_introgr_regions.fa'
depl_file = 'hg19_depl_regions.fa'
a = 0
i = 0
with open('hg19_introgr_clean.fa', 'a') as intr_out, open('hg19_depl_clean.fa', 'a') as depl_out:
    for intr, depl in zip(SeqIO.parse(intr_file, 'fasta'), SeqIO.parse(depl_file, 'fasta')):
        upper_intr = intr.seq.upper()
        upper_depl = depl.seq.upper()
        a = a + 1
        if a%1000 == 0:
            print('Finished ' + str(a) + ' entries')
        if 'N' not in str(upper_intr) and 'N' not in str(upper_depl):
            intr.seq = upper_intr
            SeqIO.write(intr, intr_out, 'fasta')
            depl.seq = upper_depl
            SeqIO.write(depl, depl_out, 'fasta')
            i = i + 1
        else:
            continue
print('We have processed ' + str(a) + ' entries and written ' + str(i) + ' entries to two fasta-files')
Finished 1000 entries
Finished 2000 entries
Finished 3000 entries
Finished 4000 entries
Finished 5000 entries
Finished 6000 entries
Finished 7000 entries
Finished 8000 entries
Finished 9000 entries
Finished 10000 entries
Finished 11000 entries
Finished 12000 entries
Finished 13000 entries
Finished 14000 entries
Finished 15000 entries
Finished 16000 entries
Finished 17000 entries
Finished 18000 entries
Finished 19000 entries
Finished 20000 entries
Finished 21000 entries
Finished 22000 entries
Finished 23000 entries
Finished 24000 entries
Finished 25000 entries
Finished 26000 entries
Finished 27000 entries
Finished 28000 entries
Finished 29000 entries
Finished 30000 entries
Finished 31000 entries
Finished 32000 entries
Finished 33000 entries
Finished 34000 entries
Finished 35000 entries
Finished 36000 entries
Finished 37000 entries
Finished 38000 entries
Finished 39000 entries
Finished 40000 entries
Finished 41000 entries
Finished 42000 entries
Finished 43000 entries
Finished 44000 entries
Finished 45000 entries
Finished 46000 entries
Finished 47000 entries
Finished 48000 entries
Finished 49000 entries
Finished 50000 entries
Finished 51000 entries
Finished 52000 entries
Finished 53000 entries
Finished 54000 entries
Finished 55000 entries
Finished 56000 entries
Finished 57000 entries
Finished 58000 entries
Finished 59000 entries
Finished 60000 entries
Finished 61000 entries
Finished 62000 entries
Finished 63000 entries
Finished 64000 entries
Finished 65000 entries
Finished 66000 entries
Finished 67000 entries
Finished 68000 entries
Finished 69000 entries
Finished 70000 entries
Finished 71000 entries
Finished 72000 entries
Finished 73000 entries
Finished 74000 entries
Finished 75000 entries
Finished 76000 entries
Finished 77000 entries
Finished 78000 entries
Finished 79000 entries
Finished 80000 entries
Finished 81000 entries
Finished 82000 entries
Finished 83000 entries
We have processed 83601 entries and written 77608 entries to two fasta-files
In [161]:
!grep -c N hg19_introgr_clean.fa
0
In [162]:
!grep -c N hg19_depl_clean.fa
0

Now everything is ready for Neanderthal introgression vs. depleted regions classification. We will read the two clean fasta-files with the sequences of introgressed and depeleted regions, convert them to one-hot encoded representation and run a Convolutional Neural Network (CNN) classifier.

In [1]:
import os
from Bio import SeqIO

os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/')

intr_file = 'hg19_introgr_clean.fa'
depl_file = 'hg19_depl_clean.fa'

a = 0
intr_seqs = []
depl_seqs = []
for intr, depl in zip(SeqIO.parse(intr_file, 'fasta'), SeqIO.parse(depl_file, 'fasta')):
    #intr_seqs.append(str(intr.seq)[0:np.min(depl_lengths)])
    #depl_seqs.append(str(depl.seq)[0:np.min(depl_lengths)])
    s_intr = str(intr.seq)[0:100]
    s_depl = str(depl.seq)[0:100]
    if s_intr.count('A')>0 and s_intr.count('C')>0 and s_intr.count('G')>0 and s_intr.count('T')>0 and \
    s_depl.count('A')>0 and s_depl.count('C')>0 and s_depl.count('G')>0 and s_depl.count('T')>0:
        intr_seqs.append(s_intr)
        depl_seqs.append(s_depl)
    a = a + 1
    if a%10000 == 0:
        print('Finished ' + str(a) + ' entries')
Finished 10000 entries
Finished 20000 entries
Finished 30000 entries
Finished 40000 entries
Finished 50000 entries
Finished 60000 entries
Finished 70000 entries
In [2]:
sequences = intr_seqs + depl_seqs
len(sequences)
Out[2]:
155062
In [3]:
import numpy as np
labels = list(np.ones(len(intr_seqs))) + list(np.zeros(len(depl_seqs)))
len(labels)
Out[3]:
155062
In [4]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

import warnings
warnings.filterwarnings('ignore')

# The LabelEncoder encodes a sequence of bases as a sequence of integers: 0, 1, 2 and 3
integer_encoder = LabelEncoder()  
# The OneHotEncoder converts an array of integers to a sparse matrix where 
# each row corresponds to one possible value of each feature, i.e. only 01 and 1 are present in the matrix
one_hot_encoder = OneHotEncoder()   
input_features = []

for sequence in sequences:
  integer_encoded = integer_encoder.fit_transform(list(sequence))
  integer_encoded = np.array(integer_encoded).reshape(-1, 1)
  one_hot_encoded = one_hot_encoder.fit_transform(integer_encoded)
  input_features.append(one_hot_encoded.toarray())

np.set_printoptions(threshold = 40)
#print(input_features.shape)
input_features = np.stack(input_features)
print("Example sequence\n-----------------------")
print('DNA Sequence #1:\n',sequences[0][:10],'...',sequences[0][-10:])
print('One hot encoding of Sequence #1:\n',input_features[0].T)
Example sequence
-----------------------
DNA Sequence #1:
 AATGACATTA ... GCCTCAGCTG
One hot encoding of Sequence #1:
 [[1. 1. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 1. 0. 0.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 1. ... 0. 1. 0.]]
In [5]:
input_features.shape
Out[5]:
(155062, 100, 4)
In [6]:
one_hot_encoder = OneHotEncoder()
labels = np.array(labels).reshape(-1, 1)
input_labels = one_hot_encoder.fit_transform(labels).toarray()

print('Labels:\n',labels.T)
print('One-hot encoded labels:\n',input_labels.T)
Labels:
 [[1. 1. 1. ... 0. 0. 0.]]
One-hot encoded labels:
 [[0. 0. 0. ... 1. 1. 1.]
 [1. 1. 1. ... 0. 0. 0.]]
In [7]:
from sklearn.model_selection import train_test_split

train_features, test_features, train_labels, test_labels = train_test_split(
    input_features, input_labels, test_size = 0.2, random_state = 42)
In [8]:
train_features.shape
Out[8]:
(124049, 100, 4)
In [9]:
train_labels.shape
Out[9]:
(124049, 2)
In [10]:
test_features.shape
Out[10]:
(31013, 100, 4)
In [11]:
test_labels.shape
Out[11]:
(31013, 2)
In [13]:
from keras.optimizers import SGD, Adam, Adadelta
from keras.layers import Conv1D, Dense, MaxPooling1D, Flatten, Dropout, Embedding, Activation
from keras.models import Sequential
from keras.regularizers import l1

import warnings
warnings.filterwarnings('ignore')

model = Sequential()

model.add(Conv1D(filters = 64, kernel_size = 5, padding = 'same', kernel_initializer = 'he_uniform', 
                 input_shape = (train_features.shape[1], 4)))
model.add(Activation("relu"))
model.add(Conv1D(filters = 64, kernel_size = 5, padding = 'same', kernel_initializer = 'he_uniform'))
model.add(Activation("relu"))
model.add(MaxPooling1D(pool_size = 2))
model.add(Dropout(0.1))

model.add(Flatten())
model.add(Dense(16, kernel_initializer = 'he_uniform'))
model.add(Activation("relu"))
model.add(Dropout(0.2))
model.add(Dense(2, activation = 'softmax'))

epochs = 100
lrate = 0.01
decay = lrate / epochs
sgd = SGD(lr = lrate, momentum = 0.9, decay = decay, nesterov = False)
model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['binary_accuracy'])
#model.compile(loss='binary_crossentropy', optimizer=Adam(lr = lrate), metrics=['binary_accuracy'])
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv1d_3 (Conv1D)            (None, 100, 64)           1344      
_________________________________________________________________
activation_4 (Activation)    (None, 100, 64)           0         
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 100, 64)           20544     
_________________________________________________________________
activation_5 (Activation)    (None, 100, 64)           0         
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 50, 64)            0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 50, 64)            0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 3200)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 16)                51216     
_________________________________________________________________
activation_6 (Activation)    (None, 16)                0         
_________________________________________________________________
dropout_4 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 2)                 34        
=================================================================
Total params: 73,138
Trainable params: 73,138
Non-trainable params: 0
_________________________________________________________________
In [14]:
import warnings
warnings.filterwarnings('ignore')

history = model.fit(train_features, train_labels, 
                    epochs = epochs, verbose = 1, validation_split = 0.2, batch_size = 128, shuffle = True)
Train on 99239 samples, validate on 24810 samples
Epoch 1/100
99239/99239 [==============================] - 32s 325us/step - loss: 0.6936 - binary_accuracy: 0.5016 - val_loss: 0.6932 - val_binary_accuracy: 0.4966
Epoch 2/100
99239/99239 [==============================] - 33s 334us/step - loss: 0.6929 - binary_accuracy: 0.5077 - val_loss: 0.6921 - val_binary_accuracy: 0.5140
Epoch 3/100
99239/99239 [==============================] - 35s 352us/step - loss: 0.6912 - binary_accuracy: 0.5217 - val_loss: 0.6881 - val_binary_accuracy: 0.5509
Epoch 4/100
99239/99239 [==============================] - 36s 366us/step - loss: 0.6845 - binary_accuracy: 0.5471 - val_loss: 0.6796 - val_binary_accuracy: 0.5737
Epoch 5/100
99239/99239 [==============================] - 36s 364us/step - loss: 0.6754 - binary_accuracy: 0.5694 - val_loss: 0.6730 - val_binary_accuracy: 0.5806
Epoch 6/100
99239/99239 [==============================] - 38s 380us/step - loss: 0.6674 - binary_accuracy: 0.5845 - val_loss: 0.6638 - val_binary_accuracy: 0.5932
Epoch 7/100
99239/99239 [==============================] - 36s 367us/step - loss: 0.6584 - binary_accuracy: 0.6003 - val_loss: 0.6598 - val_binary_accuracy: 0.6002
Epoch 8/100
99239/99239 [==============================] - 37s 369us/step - loss: 0.6496 - binary_accuracy: 0.6130 - val_loss: 0.6548 - val_binary_accuracy: 0.6061
Epoch 9/100
99239/99239 [==============================] - 34s 339us/step - loss: 0.6421 - binary_accuracy: 0.6211 - val_loss: 0.6522 - val_binary_accuracy: 0.6098
Epoch 10/100
99239/99239 [==============================] - 33s 332us/step - loss: 0.6328 - binary_accuracy: 0.6312 - val_loss: 0.6501 - val_binary_accuracy: 0.6144
Epoch 11/100
99239/99239 [==============================] - 37s 377us/step - loss: 0.6226 - binary_accuracy: 0.6438 - val_loss: 0.6431 - val_binary_accuracy: 0.6221
Epoch 12/100
99239/99239 [==============================] - 1349s 14ms/step - loss: 0.6127 - binary_accuracy: 0.6532 - val_loss: 0.6442 - val_binary_accuracy: 0.6199
Epoch 13/100
99239/99239 [==============================] - 32s 318us/step - loss: 0.6034 - binary_accuracy: 0.6610 - val_loss: 0.6358 - val_binary_accuracy: 0.6337
Epoch 14/100
99239/99239 [==============================] - 31s 309us/step - loss: 0.5953 - binary_accuracy: 0.6700 - val_loss: 0.6419 - val_binary_accuracy: 0.6282
Epoch 15/100
99239/99239 [==============================] - 31s 312us/step - loss: 0.5858 - binary_accuracy: 0.6769 - val_loss: 0.6337 - val_binary_accuracy: 0.6393
Epoch 16/100
99239/99239 [==============================] - 31s 314us/step - loss: 0.5773 - binary_accuracy: 0.6870 - val_loss: 0.6337 - val_binary_accuracy: 0.6425
Epoch 17/100
99239/99239 [==============================] - 31s 313us/step - loss: 0.5691 - binary_accuracy: 0.6921 - val_loss: 0.6303 - val_binary_accuracy: 0.6434
Epoch 18/100
99239/99239 [==============================] - 31s 310us/step - loss: 0.5591 - binary_accuracy: 0.7026 - val_loss: 0.6298 - val_binary_accuracy: 0.6498
Epoch 19/100
99239/99239 [==============================] - 31s 313us/step - loss: 0.5522 - binary_accuracy: 0.7050 - val_loss: 0.6299 - val_binary_accuracy: 0.6474
Epoch 20/100
99239/99239 [==============================] - 31s 316us/step - loss: 0.5453 - binary_accuracy: 0.7113 - val_loss: 0.6230 - val_binary_accuracy: 0.6546
Epoch 21/100
99239/99239 [==============================] - 32s 320us/step - loss: 0.5352 - binary_accuracy: 0.7165 - val_loss: 0.6268 - val_binary_accuracy: 0.6502
Epoch 22/100
99239/99239 [==============================] - 32s 321us/step - loss: 0.5321 - binary_accuracy: 0.7199 - val_loss: 0.6253 - val_binary_accuracy: 0.6501
Epoch 23/100
99239/99239 [==============================] - 32s 318us/step - loss: 0.5246 - binary_accuracy: 0.7252 - val_loss: 0.6414 - val_binary_accuracy: 0.6403
Epoch 24/100
99239/99239 [==============================] - 32s 319us/step - loss: 0.5185 - binary_accuracy: 0.7285 - val_loss: 0.6303 - val_binary_accuracy: 0.6493
Epoch 25/100
99239/99239 [==============================] - 34s 340us/step - loss: 0.5111 - binary_accuracy: 0.7337 - val_loss: 0.6262 - val_binary_accuracy: 0.6572
Epoch 26/100
99239/99239 [==============================] - 34s 338us/step - loss: 0.5056 - binary_accuracy: 0.7394 - val_loss: 0.6212 - val_binary_accuracy: 0.6626
Epoch 27/100
99239/99239 [==============================] - 34s 339us/step - loss: 0.5010 - binary_accuracy: 0.7401 - val_loss: 0.6289 - val_binary_accuracy: 0.6636
Epoch 28/100
99239/99239 [==============================] - 34s 341us/step - loss: 0.4973 - binary_accuracy: 0.7419 - val_loss: 0.6260 - val_binary_accuracy: 0.6587
Epoch 29/100
99239/99239 [==============================] - 34s 340us/step - loss: 0.4913 - binary_accuracy: 0.7459 - val_loss: 0.6307 - val_binary_accuracy: 0.6633
Epoch 30/100
99239/99239 [==============================] - 33s 337us/step - loss: 0.4885 - binary_accuracy: 0.7470 - val_loss: 0.6286 - val_binary_accuracy: 0.6696
Epoch 31/100
99239/99239 [==============================] - 34s 339us/step - loss: 0.4813 - binary_accuracy: 0.7528 - val_loss: 0.6313 - val_binary_accuracy: 0.6653
Epoch 32/100
99239/99239 [==============================] - 34s 342us/step - loss: 0.4784 - binary_accuracy: 0.7533 - val_loss: 0.6313 - val_binary_accuracy: 0.6742
Epoch 33/100
99239/99239 [==============================] - 34s 341us/step - loss: 0.4737 - binary_accuracy: 0.7575 - val_loss: 0.6292 - val_binary_accuracy: 0.6719
Epoch 34/100
99239/99239 [==============================] - 34s 340us/step - loss: 0.4715 - binary_accuracy: 0.7578 - val_loss: 0.6276 - val_binary_accuracy: 0.6711
Epoch 35/100
99239/99239 [==============================] - 34s 343us/step - loss: 0.4637 - binary_accuracy: 0.7628 - val_loss: 0.6384 - val_binary_accuracy: 0.6645
Epoch 36/100
99239/99239 [==============================] - 35s 348us/step - loss: 0.4622 - binary_accuracy: 0.7630 - val_loss: 0.6287 - val_binary_accuracy: 0.6795
Epoch 37/100
99239/99239 [==============================] - 34s 342us/step - loss: 0.4606 - binary_accuracy: 0.7641 - val_loss: 0.6368 - val_binary_accuracy: 0.6660
Epoch 38/100
99239/99239 [==============================] - 35s 348us/step - loss: 0.4563 - binary_accuracy: 0.7676 - val_loss: 0.6352 - val_binary_accuracy: 0.6700
Epoch 39/100
99239/99239 [==============================] - 37s 371us/step - loss: 0.4548 - binary_accuracy: 0.7677 - val_loss: 0.6350 - val_binary_accuracy: 0.6806
Epoch 40/100
99239/99239 [==============================] - 34s 343us/step - loss: 0.4524 - binary_accuracy: 0.7697 - val_loss: 0.6377 - val_binary_accuracy: 0.6711
Epoch 41/100
99239/99239 [==============================] - 35s 348us/step - loss: 0.4470 - binary_accuracy: 0.7708 - val_loss: 0.6334 - val_binary_accuracy: 0.6760
Epoch 42/100
99239/99239 [==============================] - 35s 349us/step - loss: 0.4456 - binary_accuracy: 0.7740 - val_loss: 0.6454 - val_binary_accuracy: 0.6719
Epoch 43/100
99239/99239 [==============================] - 37s 373us/step - loss: 0.4425 - binary_accuracy: 0.7748 - val_loss: 0.6340 - val_binary_accuracy: 0.6779
Epoch 44/100
99239/99239 [==============================] - 34s 347us/step - loss: 0.4412 - binary_accuracy: 0.7749 - val_loss: 0.6348 - val_binary_accuracy: 0.6846
Epoch 45/100
99239/99239 [==============================] - 35s 353us/step - loss: 0.4386 - binary_accuracy: 0.7769 - val_loss: 0.6398 - val_binary_accuracy: 0.6815
Epoch 46/100
99239/99239 [==============================] - 35s 357us/step - loss: 0.4348 - binary_accuracy: 0.7795 - val_loss: 0.6413 - val_binary_accuracy: 0.6845
Epoch 47/100
99239/99239 [==============================] - 35s 353us/step - loss: 0.4327 - binary_accuracy: 0.7797 - val_loss: 0.6376 - val_binary_accuracy: 0.6750
Epoch 48/100
99239/99239 [==============================] - 36s 362us/step - loss: 0.4295 - binary_accuracy: 0.7823 - val_loss: 0.6382 - val_binary_accuracy: 0.6807
Epoch 49/100
99239/99239 [==============================] - 38s 381us/step - loss: 0.4279 - binary_accuracy: 0.7829 - val_loss: 0.6449 - val_binary_accuracy: 0.6862
Epoch 50/100
99239/99239 [==============================] - 32s 320us/step - loss: 0.4253 - binary_accuracy: 0.7841 - val_loss: 0.6454 - val_binary_accuracy: 0.6810
Epoch 51/100
99239/99239 [==============================] - 32s 326us/step - loss: 0.4231 - binary_accuracy: 0.7860 - val_loss: 0.6411 - val_binary_accuracy: 0.6781
Epoch 52/100
99239/99239 [==============================] - 32s 318us/step - loss: 0.4191 - binary_accuracy: 0.7877 - val_loss: 0.6528 - val_binary_accuracy: 0.6875
Epoch 53/100
99239/99239 [==============================] - 32s 326us/step - loss: 0.4228 - binary_accuracy: 0.7860 - val_loss: 0.6486 - val_binary_accuracy: 0.6774
Epoch 54/100
99239/99239 [==============================] - 32s 324us/step - loss: 0.4176 - binary_accuracy: 0.7884 - val_loss: 0.6494 - val_binary_accuracy: 0.6894
Epoch 55/100
99239/99239 [==============================] - 32s 324us/step - loss: 0.4167 - binary_accuracy: 0.7884 - val_loss: 0.6465 - val_binary_accuracy: 0.6853
Epoch 56/100
99239/99239 [==============================] - 32s 323us/step - loss: 0.4134 - binary_accuracy: 0.7927 - val_loss: 0.6507 - val_binary_accuracy: 0.6806
Epoch 57/100
99239/99239 [==============================] - 33s 332us/step - loss: 0.4137 - binary_accuracy: 0.7921 - val_loss: 0.6509 - val_binary_accuracy: 0.6849
Epoch 58/100
99239/99239 [==============================] - 32s 323us/step - loss: 0.4087 - binary_accuracy: 0.7943 - val_loss: 0.6517 - val_binary_accuracy: 0.6834
Epoch 59/100
99239/99239 [==============================] - 33s 330us/step - loss: 0.4056 - binary_accuracy: 0.7952 - val_loss: 0.6536 - val_binary_accuracy: 0.6878
Epoch 60/100
99239/99239 [==============================] - 34s 347us/step - loss: 0.4079 - binary_accuracy: 0.7946 - val_loss: 0.6592 - val_binary_accuracy: 0.6820
Epoch 61/100
99239/99239 [==============================] - 32s 325us/step - loss: 0.4053 - binary_accuracy: 0.7968 - val_loss: 0.6583 - val_binary_accuracy: 0.6798
Epoch 62/100
99239/99239 [==============================] - 33s 328us/step - loss: 0.4053 - binary_accuracy: 0.7959 - val_loss: 0.6622 - val_binary_accuracy: 0.6835
Epoch 63/100
99239/99239 [==============================] - 32s 324us/step - loss: 0.3987 - binary_accuracy: 0.7998 - val_loss: 0.6616 - val_binary_accuracy: 0.6865
Epoch 64/100
99239/99239 [==============================] - 33s 333us/step - loss: 0.4012 - binary_accuracy: 0.7982 - val_loss: 0.6544 - val_binary_accuracy: 0.6830
Epoch 65/100
99239/99239 [==============================] - 33s 332us/step - loss: 0.3995 - binary_accuracy: 0.7996 - val_loss: 0.6598 - val_binary_accuracy: 0.6864
Epoch 66/100
99239/99239 [==============================] - 33s 329us/step - loss: 0.3974 - binary_accuracy: 0.8013 - val_loss: 0.6559 - val_binary_accuracy: 0.6828
Epoch 67/100
99239/99239 [==============================] - 32s 327us/step - loss: 0.3951 - binary_accuracy: 0.8034 - val_loss: 0.6636 - val_binary_accuracy: 0.6801
Epoch 68/100
99239/99239 [==============================] - 33s 328us/step - loss: 0.3941 - binary_accuracy: 0.8029 - val_loss: 0.6575 - val_binary_accuracy: 0.6900
Epoch 69/100
99239/99239 [==============================] - 33s 330us/step - loss: 0.3947 - binary_accuracy: 0.8036 - val_loss: 0.6649 - val_binary_accuracy: 0.6801
Epoch 70/100
99239/99239 [==============================] - 33s 329us/step - loss: 0.3893 - binary_accuracy: 0.8055 - val_loss: 0.6643 - val_binary_accuracy: 0.6848
Epoch 71/100
99239/99239 [==============================] - 33s 330us/step - loss: 0.3912 - binary_accuracy: 0.8035 - val_loss: 0.6696 - val_binary_accuracy: 0.6848
Epoch 72/100
99239/99239 [==============================] - 32s 327us/step - loss: 0.3884 - binary_accuracy: 0.8061 - val_loss: 0.6684 - val_binary_accuracy: 0.6877
Epoch 73/100
99239/99239 [==============================] - 32s 327us/step - loss: 0.3874 - binary_accuracy: 0.8082 - val_loss: 0.6683 - val_binary_accuracy: 0.6867
Epoch 74/100
99239/99239 [==============================] - 33s 330us/step - loss: 0.3875 - binary_accuracy: 0.8076 - val_loss: 0.6743 - val_binary_accuracy: 0.6776
Epoch 75/100
99239/99239 [==============================] - 33s 333us/step - loss: 0.3886 - binary_accuracy: 0.8050 - val_loss: 0.6674 - val_binary_accuracy: 0.6956
Epoch 76/100
99239/99239 [==============================] - 33s 331us/step - loss: 0.3846 - binary_accuracy: 0.8076 - val_loss: 0.6744 - val_binary_accuracy: 0.6834
Epoch 77/100
99239/99239 [==============================] - 33s 329us/step - loss: 0.3822 - binary_accuracy: 0.8096 - val_loss: 0.6856 - val_binary_accuracy: 0.6846
Epoch 78/100
99239/99239 [==============================] - 33s 332us/step - loss: 0.3823 - binary_accuracy: 0.8099 - val_loss: 0.6687 - val_binary_accuracy: 0.6896
Epoch 79/100
99239/99239 [==============================] - 35s 348us/step - loss: 0.3817 - binary_accuracy: 0.8104 - val_loss: 0.6686 - val_binary_accuracy: 0.6909
Epoch 80/100
99239/99239 [==============================] - 33s 330us/step - loss: 0.3807 - binary_accuracy: 0.8128 - val_loss: 0.6810 - val_binary_accuracy: 0.6817
Epoch 81/100
99239/99239 [==============================] - 33s 330us/step - loss: 0.3817 - binary_accuracy: 0.8101 - val_loss: 0.6761 - val_binary_accuracy: 0.6884
Epoch 82/100
99239/99239 [==============================] - 34s 346us/step - loss: 0.3793 - binary_accuracy: 0.8107 - val_loss: 0.6755 - val_binary_accuracy: 0.6863
Epoch 83/100
99239/99239 [==============================] - 33s 330us/step - loss: 0.3787 - binary_accuracy: 0.8120 - val_loss: 0.6723 - val_binary_accuracy: 0.6858
Epoch 84/100
99239/99239 [==============================] - 33s 332us/step - loss: 0.3768 - binary_accuracy: 0.8130 - val_loss: 0.6817 - val_binary_accuracy: 0.6892
Epoch 85/100
99239/99239 [==============================] - 33s 331us/step - loss: 0.3754 - binary_accuracy: 0.8141 - val_loss: 0.6730 - val_binary_accuracy: 0.6899
Epoch 86/100
99239/99239 [==============================] - 33s 331us/step - loss: 0.3752 - binary_accuracy: 0.8138 - val_loss: 0.6784 - val_binary_accuracy: 0.6831
Epoch 87/100
99239/99239 [==============================] - 33s 332us/step - loss: 0.3750 - binary_accuracy: 0.8138 - val_loss: 0.6696 - val_binary_accuracy: 0.6921
Epoch 88/100
99239/99239 [==============================] - 33s 332us/step - loss: 0.3742 - binary_accuracy: 0.8153 - val_loss: 0.6819 - val_binary_accuracy: 0.6873
Epoch 89/100
99239/99239 [==============================] - 33s 328us/step - loss: 0.3747 - binary_accuracy: 0.8143 - val_loss: 0.6827 - val_binary_accuracy: 0.6886
Epoch 90/100
99239/99239 [==============================] - 33s 333us/step - loss: 0.3724 - binary_accuracy: 0.8161 - val_loss: 0.6795 - val_binary_accuracy: 0.6913
Epoch 91/100
99239/99239 [==============================] - 34s 346us/step - loss: 0.3695 - binary_accuracy: 0.8164 - val_loss: 0.6859 - val_binary_accuracy: 0.6857
Epoch 92/100
99239/99239 [==============================] - 33s 334us/step - loss: 0.3705 - binary_accuracy: 0.8165 - val_loss: 0.6762 - val_binary_accuracy: 0.6884
Epoch 93/100
99239/99239 [==============================] - 34s 340us/step - loss: 0.3665 - binary_accuracy: 0.8193 - val_loss: 0.6838 - val_binary_accuracy: 0.6936
Epoch 94/100
99239/99239 [==============================] - 34s 343us/step - loss: 0.3699 - binary_accuracy: 0.8164 - val_loss: 0.6728 - val_binary_accuracy: 0.6936
Epoch 95/100
99239/99239 [==============================] - 33s 334us/step - loss: 0.3672 - binary_accuracy: 0.8177 - val_loss: 0.6775 - val_binary_accuracy: 0.6933
Epoch 96/100
99239/99239 [==============================] - 35s 349us/step - loss: 0.3680 - binary_accuracy: 0.8162 - val_loss: 0.6794 - val_binary_accuracy: 0.6899
Epoch 97/100
99239/99239 [==============================] - 34s 338us/step - loss: 0.3668 - binary_accuracy: 0.8194 - val_loss: 0.6906 - val_binary_accuracy: 0.6920
Epoch 98/100
99239/99239 [==============================] - 35s 349us/step - loss: 0.3649 - binary_accuracy: 0.8186 - val_loss: 0.6829 - val_binary_accuracy: 0.6897
Epoch 99/100
99239/99239 [==============================] - 33s 331us/step - loss: 0.3655 - binary_accuracy: 0.8197 - val_loss: 0.6897 - val_binary_accuracy: 0.6900
Epoch 100/100
99239/99239 [==============================] - 33s 330us/step - loss: 0.3641 - binary_accuracy: 0.8201 - val_loss: 0.6791 - val_binary_accuracy: 0.6944
In [16]:
import matplotlib.pyplot as plt

plt.figure(figsize=(20,15))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss', fontsize = 20)
plt.ylabel('Loss', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()

plt.figure(figsize=(20,15))
plt.plot(history.history['binary_accuracy'])
plt.plot(history.history['val_binary_accuracy'])
plt.title('Model Accuracy', fontsize = 20)
plt.ylabel('Accuracy', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()
In [17]:
from sklearn.metrics import confusion_matrix
import itertools

plt.figure(figsize=(15,10))

predicted_labels = model.predict(np.stack(test_features))
cm = confusion_matrix(np.argmax(test_labels, axis=1), 
                      np.argmax(predicted_labels, axis=1))
print('Confusion matrix:\n',cm)

cm = cm.astype('float') / cm.sum(axis = 1)[:, np.newaxis]

plt.imshow(cm, cmap = plt.cm.Blues)
plt.title('Normalized confusion matrix', fontsize = 20)
plt.colorbar()
plt.xlabel('True label', fontsize = 20)
plt.ylabel('Predicted label', fontsize = 20)
#plt.xticks([0, 1]); plt.yticks([0, 1])
#plt.grid('off')
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
    plt.text(j, i, format(cm[i, j], '.2f'),
             horizontalalignment = 'center', verticalalignment = 'center', fontsize = 20,
             color='white' if cm[i, j] > 0.5 else 'black')
plt.show()
Confusion matrix:
 [[ 9700  5954]
 [ 3703 11656]]
In [18]:
scores = model.evaluate(test_features, test_labels, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
Accuracy: 68.86%

The accuracy looks better now but still not fantastic. Perhaps some tricks from the Natural Language Processing (NLP) and word embeddings might help to increase the accuracy of classification.

Some Temp Computations

Here we will try to include the Word Embedding layers in the Convolutional Neural Network (CNN) in order to perform dimensionality reduction, increase accuracy of classification and visualize the low-dimensional Word Embeddings.

In [113]:
from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding

docs = ['well done', 'good work', 'great effort', 'nice work', 'excellent', 'weak', 'poor effort', 
        'not good', 'poor work', 'could have done better']
labels = array([1,1,1,1,1,0,0,0,0,0])
In [114]:
sentences = [d.split(' ') for d in docs]
print(sentences)
[['well', 'done'], ['good', 'work'], ['great', 'effort'], ['nice', 'work'], ['excellent'], ['weak'], ['poor', 'effort'], ['not', 'good'], ['poor', 'work'], ['could', 'have', 'done', 'better']]
In [115]:
import warnings
warnings.filterwarnings('ignore')

from gensim.models import Word2Vec
model = Word2Vec(sentences, min_count = 1)
print(model)
words = list(model.wv.vocab)
print(words)
print(model['work'])
W1104 10:29:01.687931 140426307450688 base_any2vec.py:1386] under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
Word2Vec(vocab=14, size=100, alpha=0.025)
['well', 'done', 'good', 'work', 'great', 'effort', 'nice', 'excellent', 'weak', 'poor', 'not', 'could', 'have', 'better']
[-0.00282436  0.00136096  0.00049871 ... -0.00405598  0.00144478
 -0.00363551]
In [118]:
my_labels = list(np.ones(8)) + list(np.zeros(6))
In [119]:
from sklearn.decomposition import PCA
from umap import UMAP
import matplotlib.pyplot as plt
X = model[model.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
plt.figure(figsize=(20,15))
plt.scatter(result[:, 0], result[:, 1], c = my_labels, s = 100, cmap = 'tab10')
words = list(model.wv.vocab)
for i, word in enumerate(words):
    #plt.annotate(word, xy=(result[i, 0], result[i, 1]), fontsize = 20)
    plt.text(result[i, 0], result[i, 1], word, fontsize = 20, c = 'red')
plt.show()
In [122]:
from umap import UMAP
import matplotlib.pyplot as plt
X = model[model.wv.vocab]
umap_model = UMAP(n_neighbors = 5, min_dist = 0.5, n_components = 2)
umap = umap_model.fit_transform(X, y = my_labels)
plt.figure(figsize=(20,15))
plt.scatter(umap[:, 0], umap[:, 1], c = my_labels, s = 100, cmap = 'tab10')
plt.title('UMAP', fontsize = 20)
plt.xlabel("UMAP1", fontsize = 20)
plt.ylabel("UMAP2", fontsize = 20)
words = list(model.wv.vocab)
for i, word in enumerate(words):
    plt.text(umap[i, 0], umap[i, 1], word, fontsize = 20, c = 'red')
plt.show()
In [ ]:
 
In [ ]:
 
In [33]:
from keras.preprocessing.text import Tokenizer
# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(docs)
# summarize what was learned
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)
# integer encode documents
encoded_docs = t.texts_to_matrix(docs, mode = 'count')
print(encoded_docs)
encoded_docs.shape
OrderedDict([('well', 1), ('done', 2), ('good', 2), ('work', 3), ('great', 1), ('effort', 2), ('nice', 1), ('excellent', 1), ('weak', 1), ('poor', 2), ('not', 1), ('could', 1), ('have', 1), ('better', 1)])
10
{'work': 1, 'done': 2, 'good': 3, 'effort': 4, 'poor': 5, 'well': 6, 'great': 7, 'nice': 8, 'excellent': 9, 'weak': 10, 'not': 11, 'could': 12, 'have': 13, 'better': 14}
defaultdict(<class 'int'>, {'done': 2, 'well': 1, 'work': 3, 'good': 2, 'effort': 2, 'great': 1, 'nice': 1, 'excellent': 1, 'weak': 1, 'poor': 2, 'not': 1, 'have': 1, 'better': 1, 'could': 1})
[[0. 0. 1. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 1. 1. 1.]]
Out[33]:
(10, 15)
In [47]:
encoded_docs[9]
Out[47]:
array([0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1.])
In [ ]:
 
In [ ]:
 
In [125]:
from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding

docs = ['Well done!',
		'Good work',
		'Great effort',
		'nice work',
		'Excellent!',
		'Weak',
		'Poor effort!',
		'not good',
		'poor work',
		'Could have done better.']
# define class labels
labels = array([1,1,1,1,1,0,0,0,0,0])
In [126]:
# integer encode the documents
vocab_size = 50
encoded_docs = [one_hot(d, vocab_size) for d in docs]
print(encoded_docs)
[[29, 7], [4, 4], [7, 22], [2, 4], [37], [7], [40, 22], [19, 4], [40, 4], [36, 14, 7, 1]]
In [127]:
# pad documents to a max length of 4 words
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)
[[29  7  0  0]
 [ 4  4  0  0]
 [ 7 22  0  0]
 [ 2  4  0  0]
 [37  0  0  0]
 [ 7  0  0  0]
 [40 22  0  0]
 [19  4  0  0]
 [40  4  0  0]
 [36 14  7  1]]
In [137]:
padded_docs = t.texts_to_matrix(docs, mode = 'count')
print(padded_docs)
print(padded_docs.shape)
max_length = padded_docs.shape[1]
vocab_size = padded_docs.shape[1]
[[0. 0. 1. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 1. 1. 1.]]
(10, 15)
In [142]:
# define the model
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# summarize the model
print(model.summary())
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_5 (Embedding)      (None, 15, 8)             120       
_________________________________________________________________
flatten_8 (Flatten)          (None, 120)               0         
_________________________________________________________________
dense_11 (Dense)             (None, 1)                 121       
=================================================================
Total params: 241
Trainable params: 241
Non-trainable params: 0
_________________________________________________________________
None
In [143]:
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=2)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))
Epoch 1/50
 - 0s - loss: 0.6947 - acc: 0.5000
Epoch 2/50
 - 0s - loss: 0.6944 - acc: 0.5000
Epoch 3/50
 - 0s - loss: 0.6941 - acc: 0.5000
Epoch 4/50
 - 0s - loss: 0.6937 - acc: 0.5000
Epoch 5/50
 - 0s - loss: 0.6934 - acc: 0.5000
Epoch 6/50
 - 0s - loss: 0.6930 - acc: 0.5000
Epoch 7/50
 - 0s - loss: 0.6926 - acc: 0.5000
Epoch 8/50
 - 0s - loss: 0.6922 - acc: 0.5000
Epoch 9/50
 - 0s - loss: 0.6918 - acc: 0.6000
Epoch 10/50
 - 0s - loss: 0.6914 - acc: 0.6000
Epoch 11/50
 - 0s - loss: 0.6910 - acc: 0.6000
Epoch 12/50
 - 0s - loss: 0.6906 - acc: 0.6000
Epoch 13/50
 - 0s - loss: 0.6902 - acc: 0.6000
Epoch 14/50
 - 0s - loss: 0.6897 - acc: 0.6000
Epoch 15/50
 - 0s - loss: 0.6893 - acc: 0.6000
Epoch 16/50
 - 0s - loss: 0.6888 - acc: 0.6000
Epoch 17/50
 - 0s - loss: 0.6884 - acc: 0.6000
Epoch 18/50
 - 0s - loss: 0.6879 - acc: 0.6000
Epoch 19/50
 - 0s - loss: 0.6874 - acc: 0.6000
Epoch 20/50
 - 0s - loss: 0.6869 - acc: 0.6000
Epoch 21/50
 - 0s - loss: 0.6864 - acc: 0.6000
Epoch 22/50
 - 0s - loss: 0.6858 - acc: 0.6000
Epoch 23/50
 - 0s - loss: 0.6853 - acc: 0.6000
Epoch 24/50
 - 0s - loss: 0.6847 - acc: 0.6000
Epoch 25/50
 - 0s - loss: 0.6841 - acc: 0.6000
Epoch 26/50
 - 0s - loss: 0.6835 - acc: 0.6000
Epoch 27/50
 - 0s - loss: 0.6829 - acc: 0.6000
Epoch 28/50
 - 0s - loss: 0.6823 - acc: 0.6000
Epoch 29/50
 - 0s - loss: 0.6816 - acc: 0.6000
Epoch 30/50
 - 0s - loss: 0.6809 - acc: 0.6000
Epoch 31/50
 - 0s - loss: 0.6802 - acc: 0.6000
Epoch 32/50
 - 0s - loss: 0.6795 - acc: 0.6000
Epoch 33/50
 - 0s - loss: 0.6788 - acc: 0.6000
Epoch 34/50
 - 0s - loss: 0.6780 - acc: 0.6000
Epoch 35/50
 - 0s - loss: 0.6772 - acc: 0.6000
Epoch 36/50
 - 0s - loss: 0.6764 - acc: 0.6000
Epoch 37/50
 - 0s - loss: 0.6756 - acc: 0.6000
Epoch 38/50
 - 0s - loss: 0.6747 - acc: 0.6000
Epoch 39/50
 - 0s - loss: 0.6738 - acc: 0.6000
Epoch 40/50
 - 0s - loss: 0.6729 - acc: 0.6000
Epoch 41/50
 - 0s - loss: 0.6720 - acc: 0.7000
Epoch 42/50
 - 0s - loss: 0.6710 - acc: 0.8000
Epoch 43/50
 - 0s - loss: 0.6700 - acc: 0.8000
Epoch 44/50
 - 0s - loss: 0.6690 - acc: 0.8000
Epoch 45/50
 - 0s - loss: 0.6679 - acc: 0.8000
Epoch 46/50
 - 0s - loss: 0.6669 - acc: 0.9000
Epoch 47/50
 - 0s - loss: 0.6657 - acc: 0.9000
Epoch 48/50
 - 0s - loss: 0.6646 - acc: 0.9000
Epoch 49/50
 - 0s - loss: 0.6634 - acc: 0.9000
Epoch 50/50
 - 0s - loss: 0.6622 - acc: 0.9000
Accuracy: 89.999998

Now we will do a simple sentiment analysis using the IMDB reviews. The Movie Review Data is a collection of movie reviews retrieved from the imdb.com website in the early 2000s by Bo Pang and Lillian Lee. The reviews were collected and made available as part of their research on natural language processing. The dataset is comprised of 1,000 positive and 1,000 negative movie reviews drawn from an archive of the rec.arts.movies.reviews newsgroup hosted at imdb.com.

In [2]:
from string import punctuation
from os import listdir
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D

import os
os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/')

from nltk.corpus import stopwords
import string
 
# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text
 
# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', string.punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens
 
# load the document
filename = 'txt_sentoken/pos/cv000_29590.txt'
text = load_doc(filename)
tokens = clean_doc(text)
print(tokens[0:20])
['films', 'adapted', 'comic', 'books', 'plenty', 'success', 'whether', 'theyre', 'superheroes', 'batman', 'superman', 'spawn', 'geared', 'toward', 'kids', 'casper', 'arthouse', 'crowd', 'ghost', 'world']
In [3]:
stop_words = list(set(stopwords.words('english')))
stop_words[0:5]
Out[3]:
['more', 'theirs', 'because', 'how', 'her']
In [4]:
table = str.maketrans('', '', string.punctuation)
print(string.punctuation)
'book,#$'.translate(table)
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
Out[4]:
'book'
In [5]:
from string import punctuation
from os import listdir
from collections import Counter
from nltk.corpus import stopwords

# load doc and add to vocab
def add_doc_to_vocab(filename, vocab):
	# load doc
	doc = load_doc(filename)
	# clean doc
	tokens = clean_doc(doc)
	# update counts
	vocab.update(tokens)
 
# load all docs in a directory
def process_docs(directory, vocab, is_trian):
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip any reviews in the test set
		if is_trian and filename.startswith('cv9'):
			continue
		if not is_trian and not filename.startswith('cv9'):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# add doc to vocab
		add_doc_to_vocab(path, vocab)
 
# define vocab
vocab = Counter()
# add all docs to vocab
process_docs('txt_sentoken/neg', vocab, True)
process_docs('txt_sentoken/pos', vocab, True)
# print the size of the vocab
print(len(vocab))
# print the top words in the vocab
print(vocab.most_common(50))
44276
[('film', 7983), ('one', 4946), ('movie', 4826), ('like', 3201), ('even', 2262), ('good', 2080), ('time', 2041), ('story', 1907), ('films', 1873), ('would', 1844), ('much', 1824), ('also', 1757), ('characters', 1735), ('get', 1724), ('character', 1703), ('two', 1643), ('first', 1588), ('see', 1557), ('way', 1515), ('well', 1511), ('make', 1418), ('really', 1407), ('little', 1351), ('life', 1334), ('plot', 1288), ('people', 1269), ('bad', 1248), ('could', 1248), ('scene', 1241), ('movies', 1238), ('never', 1201), ('best', 1179), ('new', 1140), ('scenes', 1135), ('man', 1131), ('many', 1130), ('doesnt', 1118), ('know', 1092), ('dont', 1086), ('hes', 1024), ('great', 1014), ('another', 992), ('action', 985), ('love', 977), ('us', 967), ('go', 952), ('director', 948), ('end', 946), ('something', 945), ('still', 936)]
In [6]:
# keep tokens with a min occurrence
min_occurane = 2
tokens = [k for k,c in vocab.items() if c >= min_occurane]
print(len(tokens))
25767
In [7]:
# save list to file
def save_list(lines, filename):
	# convert lines to a single blob of text
	data = '\n'.join(lines)
	# open file
	file = open(filename, 'w')
	# write text
	file.write(data)
	# close file
	file.close()
 
# save tokens to a vocabulary file
save_list(tokens, 'vocab.txt')
In [8]:
from string import punctuation
from os import listdir
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
 
# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text
 
# turn a doc into clean tokens
def clean_doc(doc, vocab):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', punctuation)
	tokens = [w.translate(table) for w in tokens]
	# filter out tokens not in vocab
	tokens = [w for w in tokens if w in vocab]
	tokens = ' '.join(tokens)
	return tokens
 
# load all docs in a directory
def process_docs(directory, vocab, is_trian):
	documents = list()
	# walk through all files in the folder
	for filename in listdir(directory):
		# skip any reviews in the test set
		if is_trian and filename.startswith('cv9'):
			continue
		if not is_trian and not filename.startswith('cv9'):
			continue
		# create the full path of the file to open
		path = directory + '/' + filename
		# load the doc
		doc = load_doc(path)
		# clean doc
		tokens = clean_doc(doc, vocab)
		# add to list
		documents.append(tokens)
	return documents
 
# load the vocabulary
vocab_filename = 'vocab.txt'
vocab = load_doc(vocab_filename)
vocab = vocab.split()
vocab = set(vocab)
 
# load all training reviews
positive_docs = process_docs('txt_sentoken/pos', vocab, True)
negative_docs = process_docs('txt_sentoken/neg', vocab, True)
train_docs = negative_docs + positive_docs
In [12]:
# create the tokenizer
tokenizer = Tokenizer()
# fit the tokenizer on the documents
tokenizer.fit_on_texts(train_docs)
 
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(train_docs)
# pad sequences
max_length = max([len(s.split()) for s in train_docs])
Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define training labels
ytrain = array([0 for _ in range(900)] + [1 for _ in range(900)])
In [20]:
len(encoded_docs[0])
Out[20]:
310
In [22]:
max_length
Out[22]:
1317
In [24]:
Xtrain.shape
Out[24]:
(1800, 1317)
In [25]:
# load all test reviews
positive_docs = process_docs('txt_sentoken/pos', vocab, False)
negative_docs = process_docs('txt_sentoken/neg', vocab, False)
test_docs = negative_docs + positive_docs
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(test_docs)
# pad sequences
Xtest = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
# define test labels
ytest = array([0 for _ in range(100)] + [1 for _ in range(100)])
In [26]:
# define vocabulary size (largest integer value)
vocab_size = len(tokenizer.word_index) + 1
vocab_size
Out[26]:
25768
In [28]:
# define model
model = Sequential()
model.add(Embedding(vocab_size, 100, input_length=max_length))
model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 1317, 100)         2576800   
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 1310, 32)          25632     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 655, 32)           0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 20960)             0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)                209610    
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 11        
=================================================================
Total params: 2,812,053
Trainable params: 2,812,053
Non-trainable params: 0
_________________________________________________________________
None
In [30]:
# compile network
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(Xtrain, ytrain, epochs=10, verbose=1)
# evaluate
loss, acc = model.evaluate(Xtest, ytest, verbose=0)
print('Test Accuracy: %f' % (acc*100))
Epoch 1/10
1800/1800 [==============================] - 10s 5ms/step - loss: 0.1128 - acc: 0.9728
Epoch 2/10
1800/1800 [==============================] - 10s 6ms/step - loss: 0.0124 - acc: 1.0000
Epoch 3/10
1800/1800 [==============================] - 11s 6ms/step - loss: 0.0023 - acc: 1.0000
Epoch 4/10
1800/1800 [==============================] - 10s 6ms/step - loss: 7.2066e-04 - acc: 1.0000
Epoch 5/10
1800/1800 [==============================] - 10s 6ms/step - loss: 4.6669e-04 - acc: 1.0000
Epoch 6/10
1800/1800 [==============================] - 10s 6ms/step - loss: 3.4066e-04 - acc: 1.0000
Epoch 7/10
1800/1800 [==============================] - 11s 6ms/step - loss: 2.4853e-04 - acc: 1.0000
Epoch 8/10
1800/1800 [==============================] - 10s 6ms/step - loss: 1.8236e-04 - acc: 1.0000
Epoch 9/10
1800/1800 [==============================] - 10s 6ms/step - loss: 1.3405e-04 - acc: 1.0000
Epoch 10/10
1800/1800 [==============================] - 10s 6ms/step - loss: 1.1555e-04 - acc: 1.0000
Test Accuracy: 88.000000
In [ ]:
 
In [ ]:
 
In [ ]:
 

Use NLP and Word Embeddings

Here we will try to include the Word Embedding layers in the Convolutional Neural Network (CNN) in order to perform dimensionality reduction, increase accuracy of classification and visualize the low-dimensional Word Embeddings. We start with reading the sequences of Neanderthal introgressed and depleted regions and merging them together for later classification.

In [1]:
import os
from Bio import SeqIO

os.chdir('/home/nikolay/Documents/Medium/DeepLearningNeanderthalIntrogression/')

intr_file = 'hg19_introgr_clean.fa'
depl_file = 'hg19_depl_clean.fa'

e = 0
intr_seqs = []
depl_seqs = []
for intr, depl in zip(SeqIO.parse(intr_file, 'fasta'), SeqIO.parse(depl_file, 'fasta')):
    intr_seqs.append(str(intr.seq)[0:100])
    depl_seqs.append(str(depl.seq)[0:100])
    '''
    step = 100; jump = 1500; a = 0; b = step
    for j in range(5):
        s_intr = str(intr.seq)[a:b]
        s_depl = str(depl.seq)[a:b]
        if s_intr.count('A')>0 and s_intr.count('C')>0 and s_intr.count('G')>0 and s_intr.count('T')>0 and \
        s_depl.count('A')>0 and s_depl.count('C')>0 and s_depl.count('G')>0 and s_depl.count('T')>0:
            intr_seqs.append(s_intr)
            depl_seqs.append(s_depl)
        a = a + jump
        b = a + step
    '''
    e = e + 1
    if e%10000 == 0:
        print('Finished ' + str(e) + ' entries')
Finished 10000 entries
Finished 20000 entries
Finished 30000 entries
Finished 40000 entries
Finished 50000 entries
Finished 60000 entries
Finished 70000 entries
In [2]:
sequences = intr_seqs + depl_seqs
len(sequences)
Out[2]:
155216
In [3]:
sequences[0:10]
Out[3]:
['AATGACATTACTATGACAATTTGCTTGAGAATCTTTTGTTGCAAAGAATCTTCAGGAAACACCCGTCTGGCAGCATGGTGGGTGTGGAGGGCCTCAGCTG',
 'GGATGCCCTCAATGGCAGCAGCCCATCTCTGGGCACCATGCAGAGCAGTCCACCAGCATCTGGCTGCATGTGGATAATGCCACTGAGGTGTGGTCTTGGG',
 'ATGGCTTGAGCCCAGGAGCTCAAGACCAGCTTGAACAACATAGTGAGACCCTGTCTCAACAAAAAGTACAAACATTAGCTGGGTGTGGTGGTGCATGCCT',
 'ATGGCTTGAGCCCAGGAGCTCAAGACCAGCTTGAACAACATAGTGAGACCCTGTCTCAACAAAAAGTACAAACATTAGCTGGGTGTGGTGGTGCATGCCT',
 'ATGGCTTGAGCCCAGGAGCTCAAGACCAGCTTGAACAACATAGTGAGACCCTGTCTCAACAAAAAGTACAAACATTAGCTGGGTGTGGTGGTGCATGCCT',
 'ATGGCTTGAGCCCAGGAGCTCAAGACCAGCTTGAACAACATAGTGAGACCCTGTCTCAACAAAAAGTACAAACATTAGCTGGGTGTGGTGGTGCATGCCT',
 'ATGGCTTGAGCCCAGGAGCTCAAGACCAGCTTGAACAACATAGTGAGACCCTGTCTCAACAAAAAGTACAAACATTAGCTGGGTGTGGTGGTGCATGCCT',
 'ATGGCTTGAGCCCAGGAGCTCAAGACCAGCTTGAACAACATAGTGAGACCCTGTCTCAACAAAAAGTACAAACATTAGCTGGGTGTGGTGGTGCATGCCT',
 'ATGGCTTGAGCCCAGGAGCTCAAGACCAGCTTGAACAACATAGTGAGACCCTGTCTCAACAAAAAGTACAAACATTAGCTGGGTGTGGTGGTGCATGCCT',
 'ATGGCTTGAGCCCAGGAGCTCAAGACCAGCTTGAACAACATAGTGAGACCCTGTCTCAACAAAAAGTACAAACATTAGCTGGGTGTGGTGGTGCATGCCT']

Suppose the sequences in the list above are different texts. It is natural to consider k-mers as words of those texts. Different sequences can share the same k-mers indicating that the same words can be used in different texts. However, there are words / k-mers that are specific for certain texts / sequences, or their number can say something about the topic of the text / biology of the sequence. Here we are going to split each sequence into k-mers and construct sentences which represent lists of words / k-mers.

In [4]:
def getKmers(sequence, size):
    return [sequence[x:x+size].upper() for x in range(len(sequence) - size + 1)]
In [5]:
print('Building Neanderthal introgressed sequences')
intr_sentences = []
for i in range(len(intr_seqs)):
    intr_sentences.append(getKmers(intr_seqs[i], 5))

print('Building Neanderthal depleted sequences')
depl_sentences = []
for i in range(len(depl_seqs)):
    depl_sentences.append(getKmers(depl_seqs[i], 5))

print('Building merged Neanderthal introgressed and depleted sequences')
sentences = []
for i in range(len(sequences)):
    sentences.append(getKmers(sequences[i], 5))
Building Neanderthal introgressed sequences
Building Neanderthal depleted sequences
Building merged Neanderthal introgressed and depleted sequences
In [6]:
sentences[0][0:5]
Out[6]:
['AATGA', 'ATGAC', 'TGACA', 'GACAT', 'ACATT']

The words / k-mers provide a vocabulary, we will determine its size later. We can also use the Counter class for an efficient counting of the words as well as displaying and making a barplot of the most common words for the merged Neanderthal introgressed and depleted regions.

In [7]:
from collections import Counter
Counter([item for sublist in sentences for item in sublist]).most_common(10)
Out[7]:
[('TTTTT', 98546),
 ('AAAAA', 87867),
 ('ATTTT', 61714),
 ('AAAAT', 59269),
 ('TTTTA', 51861),
 ('TATTT', 51797),
 ('AAATA', 50218),
 ('TAAAA', 49223),
 ('TTTCT', 49150),
 ('AGAAA', 48903)]
In [9]:
import matplotlib.pyplot as plt
plt.figure(figsize=(20,15))
D = dict(Counter([item for sublist in sentences for item in sublist]).most_common(20))
plt.bar(range(len(D)), list(D.values()), align='center')
plt.title('Most Common K-mers', fontsize = 20)
plt.ylabel("Counts", fontsize = 20)
plt.xticks(rotation = 90)
plt.xticks(range(len(D)), list(D.keys()), fontsize = 20)
plt.show()

Now let us disaply the k-mer / word / token counts for Neanderthal introgressed regions:

In [10]:
import pandas as pd
intr_counts = dict(Counter([item for sublist in intr_sentences for item in sublist]))
kmers = list(intr_counts.keys())
counts = list(intr_counts.values())
intr_df = pd.DataFrame({'Kmer': kmers, 'Count': counts})
intr_df = intr_df.sort_values(['Count'], ascending = False)
intr_df.head(10)
Out[10]:
Kmer Count
818 TTTTT 48217
206 AAAAA 37705
861 ATTTT 30912
649 AAAAT 28796
866 TTTTA 26415
901 TATTT 26271
530 TTTCT 25166
351 AGAAA 24998
650 AAATA 24659
806 TAAAA 24461

We can also check the GC content in the Neanderthal introgressed regions:

In [11]:
intr_long_seq = ''.join(intr_seqs)
((intr_long_seq.count('C') + intr_long_seq.count('G')) / len(intr_long_seq))*100
Out[11]:
40.299788681579216

The GC-content for the chopped Neanderthal introgression sequences looks quite different from what we would have obtained without chopping the sequences.

In [32]:
%%bash
cat hg19_introgr_clean.fa | grep -v ">" | awk 'BEGIN{a=0; c=0; g=0; t=0;} {a+=gsub("A",""); 
c+=gsub("C",""); g+=gsub("G",""); t+=gsub("T","");} END{print a,c,g,t}'
2131712452 1415460271 1416141395 2138472589
In [34]:
((1415460271 + 1416141395) / (2131712452 + 1415460271 + 1416141395 + 2138472589))*100
Out[34]:
39.87167994230216

Now for comparison we will check the k-mer / word / token count for Neanderthal depeleted sequences and compute the GC-content for the chopped and full regions:

In [12]:
import pandas as pd
depl_counts = dict(Counter([item for sublist in depl_sentences for item in sublist]))
kmers = list(depl_counts.keys())
counts = list(depl_counts.values())
depl_df = pd.DataFrame({'Kmer': kmers, 'Count': counts})
depl_df = depl_df.sort_values(['Count'], ascending = False)
depl_df.head(10)
Out[12]:
Kmer Count
148 TTTTT 50329
248 AAAAA 50162
211 ATTTT 30802
258 AAAAT 30473
223 AAATA 25559
7 TATTT 25526
197 TTTTA 25446
247 TAAAA 24762
0 TTTCT 23984
63 AGAAA 23905
In [13]:
depl_long_seq = ''.join(depl_seqs)
((depl_long_seq.count('C') + depl_long_seq.count('G')) / len(depl_long_seq))*100
Out[13]:
40.73978198123905
In [33]:
%%bash
cat hg19_depl_clean.fa | grep -v ">" | awk 'BEGIN{a=0; c=0; g=0; t=0;} {a+=gsub("A",""); 
c+=gsub("C",""); g+=gsub("G",""); t+=gsub("T","");} END{print a,c,g,t}'
2106254094 1442323861 1442967008 2110241744
In [35]:
((1442323861 + 1442967008) / (2106254094 + 1442323861 + 1442967008 + 2110241744))*100
Out[35]:
40.62767565457947
In [14]:
merge_df = pd.merge(intr_df, depl_df, on = 'Kmer')
merge_df['Odds'] = merge_df['Count_y'] / merge_df['Count_x']
sorted_merge_df = merge_df.sort_values(['Odds'], ascending = False)
sorted_merge_df.head()
Out[14]:
Kmer Count_x Count_y Odds
1017 CGCGC 313 638 2.038339
955 GCGCC 785 1591 2.026752
829 CCGCC 1254 2465 1.965710
868 CGCCC 1075 2023 1.881860
905 GGCGT 962 1769 1.838877
In [15]:
sorted_merge_df.tail()
Out[15]:
Kmer Count_x Count_y Odds
880 TTCGG 1039 887 0.853705
883 CCGTT 1033 880 0.851888
997 GTACG 613 520 0.848287
958 CCGTA 767 639 0.833116
933 TACGG 859 708 0.824214

What we can see is that there is 0.5% for the chopped sequences and almost 0.8% difference in GC-content between the Neanderthal depleted and introgressed regions. It looks like the depleted regions have a higher GC-content indicating that they most likely belong to the gene regions.

Now we can think that the k-mers originating from the same sequence build a sentence or text or document (doc). For example let us display one sentence, we can clearly see words / k-mers separated by spaces lie in a regular text.

In [16]:
' '.join(intr_sentences[0])
Out[16]:
'AATGA ATGAC TGACA GACAT ACATT CATTA ATTAC TTACT TACTA ACTAT CTATG TATGA ATGAC TGACA GACAA ACAAT CAATT AATTT ATTTG TTTGC TTGCT TGCTT GCTTG CTTGA TTGAG TGAGA GAGAA AGAAT GAATC AATCT ATCTT TCTTT CTTTT TTTTG TTTGT TTGTT TGTTG GTTGC TTGCA TGCAA GCAAA CAAAG AAAGA AAGAA AGAAT GAATC AATCT ATCTT TCTTC CTTCA TTCAG TCAGG CAGGA AGGAA GGAAA GAAAC AAACA AACAC ACACC CACCC ACCCG CCCGT CCGTC CGTCT GTCTG TCTGG CTGGC TGGCA GGCAG GCAGC CAGCA AGCAT GCATG CATGG ATGGT TGGTG GGTGG GTGGG TGGGT GGGTG GGTGT GTGTG TGTGG GTGGA TGGAG GGAGG GAGGG AGGGC GGGCC GGCCT GCCTC CCTCA CTCAG TCAGC CAGCT AGCTG'

Now we are going to convert the sentences into word vectors, i.e. quantify the words / k-mers based on their similarity to each other. The number of unique k-mers gives us the vocabulary / dictionary of our molecular / biological texts.

In [17]:
import warnings
warnings.filterwarnings('ignore')

from gensim.models import Word2Vec
model = Word2Vec(sentences, min_count = 2, workers = 4)
print(model)
Word2Vec(vocab=1024, size=100, alpha=0.025)
In [18]:
X = model[model.wv.vocab]
X.shape
Out[18]:
(1024, 100)

Now each word is one observation, this observation has 100 coordinates, i.e. the default number of latent variables for word2vec. Next we can try to use the constructed word vectors and visualize the k-mers space using PCA and UMAP.

In [19]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
X = model[model.wv.vocab]
pca = PCA(n_components = 2)
result = pca.fit_transform(X)

plt.figure(figsize=(20,18))
plt.scatter(result[:, 0], result[:, 1], s = 10, cmap = 'tab10')
plt.title('Principal Components Analysis (PCA) of K-mers', fontsize = 20)
plt.xlabel("PC1", fontsize = 20)
plt.ylabel("PC2", fontsize = 20)
words = list(model.wv.vocab)
for i, word in enumerate(words):
    if word == 'CCGCG':
        plt.text(result[i, 0], result[i, 1], word, fontsize = 30, c = 'green')
    elif word == 'CGCGG':
        plt.text(result[i, 0], result[i, 1], word, fontsize = 30, c = 'green')
    elif word == 'TTATA':
        plt.text(result[i, 0], result[i, 1], word, fontsize = 30, c = 'green')
    elif word == 'TATAA':
        plt.text(result[i, 0], result[i, 1], word, fontsize = 30, c = 'green')
    else:
        plt.text(result[i, 0], result[i, 1], word, fontsize = 10, c = 'red')
plt.show()

From the PCA plot it is clear that the left cluster of k-mers has an enrichment of GC-nucleotides while the left bigger cluster has predominantly AT-nucleotides.

In [20]:
from umap import UMAP
import matplotlib.pyplot as plt
X = model[model.wv.vocab]
X_reduced = PCA(n_components = 5).fit_transform(X)
umap_model = UMAP(n_neighbors = 30, min_dist = 0.1, n_components = 2)
umap = umap_model.fit_transform(X_reduced)
plt.figure(figsize=(20,15))
plt.scatter(umap[:, 0], umap[:, 1], s = 10, cmap = 'tab10')
plt.title('UMAP', fontsize = 20)
plt.xlabel("UMAP1", fontsize = 20)
plt.ylabel("UMAP2", fontsize = 20)
words = list(model.wv.vocab)
for i, word in enumerate(words):
    if word == 'CCGCG':
        plt.text(umap[i, 0], umap[i, 1], word, fontsize = 30, c = 'green')
    elif word == 'CGCGG':
        plt.text(umap[i, 0], umap[i, 1], word, fontsize = 30, c = 'green')
    elif word == 'TATAA':
        plt.text(umap[i, 0], umap[i, 1], word, fontsize = 30, c = 'green')
    elif word == 'TTATA':
        plt.text(umap[i, 0], umap[i, 1], word, fontsize = 30, c = 'green')
    else:
        plt.text(umap[i, 0], umap[i, 1], word, fontsize = 10, c = 'red')
plt.show()
In [21]:
from sklearn.manifold import TSNE

plt.figure(figsize=(20, 15))
X_reduced = PCA(n_components = 5).fit_transform(X)
tsne_model = TSNE(learning_rate = 500, n_components = 2, random_state = 123, perplexity = 30)
tsne = tsne_model.fit_transform(X_reduced)
plt.scatter(tsne[:, 0], tsne[:, 1], cmap = 'tab10', s = 10)
plt.title('tSNE on PCA', fontsize = 20)
plt.xlabel("tSNE1", fontsize = 20)
plt.ylabel("tSNE2", fontsize = 20)
words = list(model.wv.vocab)
for i, word in enumerate(words):
    if word == 'CCGCG':
        plt.text(tsne[i, 0], tsne[i, 1], word, fontsize = 30, c = 'green')
    elif word == 'CGCGG':
        plt.text(tsne[i, 0], tsne[i, 1], word, fontsize = 30, c = 'green')
    elif word == 'TATAA':
        plt.text(tsne[i, 0], tsne[i, 1], word, fontsize = 30, c = 'green')
    elif word == 'TTATA':
        plt.text(tsne[i, 0], tsne[i, 1], word, fontsize = 30, c = 'green')
    else:
        plt.text(tsne[i, 0], tsne[i, 1], word, fontsize = 10, c = 'red')
plt.show()

From both UMAP and tSNE plots we again clearly see that the left cluster of k-mers has an enrichment of GC-nucleotides while the left bigger cluster has predominantly AT-nucleotides.

Now let us define an important variable max_length which will be used in the keras Embedding layer. This variable indicates how many k-mers we have in each text / sequence. Next we will concatenate the k-mers from each sequence into a "sentence", these sentences will build documents (docs):

In [22]:
max_length = len(sentences[0])
print(max_length)
96
In [23]:
from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding

docs = [' '.join(i) for i in sentences]
print(docs[0],'\n')
print('We have ' + str(len(docs)) + ' texts / sequences')
print('Each text / sequence has ' + str(len(docs[0].split(' '))) + ' words / k-mers')
Using TensorFlow backend.
AATGA ATGAC TGACA GACAT ACATT CATTA ATTAC TTACT TACTA ACTAT CTATG TATGA ATGAC TGACA GACAA ACAAT CAATT AATTT ATTTG TTTGC TTGCT TGCTT GCTTG CTTGA TTGAG TGAGA GAGAA AGAAT GAATC AATCT ATCTT TCTTT CTTTT TTTTG TTTGT TTGTT TGTTG GTTGC TTGCA TGCAA GCAAA CAAAG AAAGA AAGAA AGAAT GAATC AATCT ATCTT TCTTC CTTCA TTCAG TCAGG CAGGA AGGAA GGAAA GAAAC AAACA AACAC ACACC CACCC ACCCG CCCGT CCGTC CGTCT GTCTG TCTGG CTGGC TGGCA GGCAG GCAGC CAGCA AGCAT GCATG CATGG ATGGT TGGTG GGTGG GTGGG TGGGT GGGTG GGTGT GTGTG TGTGG GTGGA TGGAG GGAGG GAGGG AGGGC GGGCC GGCCT GCCTC CCTCA CTCAG TCAGC CAGCT AGCTG 

We have 155216 texts / sequences
Each text / sequence has 96 words / k-mers
In [24]:
from string import punctuation
from os import listdir
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
# create the tokenizer
tokenizer = Tokenizer()
# fit the tokenizer on the documents
tokenizer.fit_on_texts(docs)
In [25]:
# sequence encode
encoded_docs = tokenizer.texts_to_sequences(docs)
# pad sequences
max_length = max([len(s.split()) for s in docs])
print(max_length)
#Xtrain = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
96
In [26]:
print(encoded_docs[0])
print(len(encoded_docs[0]))
print(len(encoded_docs))
[66, 628, 353, 556, 90, 308, 467, 304, 615, 513, 620, 371, 628, 353, 527, 235, 258, 14, 69, 210, 149, 133, 639, 313, 216, 93, 75, 76, 652, 202, 144, 21, 24, 25, 36, 49, 250, 722, 288, 271, 227, 116, 26, 23, 76, 652, 202, 144, 120, 169, 134, 275, 124, 50, 70, 322, 37, 432, 659, 576, 859, 844, 872, 815, 585, 255, 404, 309, 251, 584, 139, 364, 540, 361, 400, 242, 301, 433, 409, 579, 650, 164, 249, 523, 150, 131, 468, 631, 730, 580, 342, 200, 126, 379, 212, 217]
96
155216
In [27]:
import numpy as np
my_labels = list(np.ones(len(intr_seqs))) + list(np.zeros(len(depl_seqs)))
len(my_labels)
Out[27]:
155216
In [28]:
# integer encode the documents
#vocab_size = 1500
#encoded_docs = [one_hot(d, vocab_size) for d in docs]
#print(encoded_docs[0:10])
In [29]:
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)
[[ 66 628 353 ... 379 212 217]
 [568 734 581 ... 204 372 358]
 [596 167 539 ... 557 581 137]
 ...
 [174 338 138 ...  13   9  59]
 [ 32   7  22 ... 485 307 280]
 [404 309 251 ... 246 389 297]]
In [30]:
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)
1025
In [31]:
from sklearn.model_selection import train_test_split

train_features, test_features, train_labels, test_labels = train_test_split(
    padded_docs, my_labels, test_size = 0.2, random_state = 42)
In [32]:
train_features.shape
Out[32]:
(124172, 96)
In [33]:
len(train_labels)
Out[33]:
124172
In [34]:
test_features.shape
Out[34]:
(31044, 96)
In [35]:
len(test_labels)
Out[35]:
31044
In [54]:
from keras.optimizers import SGD, Adam, Adadelta
from keras.layers import Conv1D, Dense, MaxPooling1D, Flatten, Dropout, Embedding, Activation
from keras.models import Sequential
from keras.regularizers import l1

import warnings
warnings.filterwarnings('ignore')

model = Sequential()
model.add(Embedding(vocab_size, 100, input_length = max_length))
model.add(Conv1D(filters = 128, kernel_size = 5, padding = 'same', kernel_initializer = 'he_uniform'))
model.add(Activation("relu"))
model.add(Conv1D(filters = 128, kernel_size = 5, padding = 'same', kernel_initializer = 'he_uniform'))
model.add(Activation("relu"))
model.add(MaxPooling1D(pool_size = 2))
model.add(Dropout(0.5))
#model.add(Conv1D(filters = 128, kernel_size = 5, padding = 'same', kernel_initializer = 'he_uniform'))
#model.add(Activation("relu"))
#model.add(MaxPooling1D(pool_size = 2))
#model.add(Dropout(0.4))

model.add(Flatten())
model.add(Dense(16, kernel_initializer = 'he_uniform'))
model.add(Activation("relu"))
model.add(Dropout(0.5))
model.add(Dense(1, activation = 'sigmoid'))

epochs = 100
lrate = 0.01
decay = lrate / epochs
sgd = SGD(lr = lrate, momentum = 0.9, decay = decay, nesterov = False)
model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['binary_accuracy'])
#model.compile(loss='binary_crossentropy', optimizer=Adam(lr = lrate), metrics=['binary_accuracy'])
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_10 (Embedding)     (None, 96, 100)           102500    
_________________________________________________________________
conv1d_11 (Conv1D)           (None, 96, 128)           64128     
_________________________________________________________________
activation_20 (Activation)   (None, 96, 128)           0         
_________________________________________________________________
conv1d_12 (Conv1D)           (None, 96, 128)           82048     
_________________________________________________________________
activation_21 (Activation)   (None, 96, 128)           0         
_________________________________________________________________
max_pooling1d_11 (MaxPooling (None, 48, 128)           0         
_________________________________________________________________
dropout_20 (Dropout)         (None, 48, 128)           0         
_________________________________________________________________
flatten_10 (Flatten)         (None, 6144)              0         
_________________________________________________________________
dense_19 (Dense)             (None, 16)                98320     
_________________________________________________________________
activation_22 (Activation)   (None, 16)                0         
_________________________________________________________________
dropout_21 (Dropout)         (None, 16)                0         
_________________________________________________________________
dense_20 (Dense)             (None, 1)                 17        
=================================================================
Total params: 347,013
Trainable params: 347,013
Non-trainable params: 0
_________________________________________________________________
In [55]:
import warnings
warnings.filterwarnings('ignore')

history = model.fit(train_features, train_labels, 
                    epochs = epochs, verbose = 1, validation_split = 0.2, batch_size = 128, shuffle = True)
Train on 99337 samples, validate on 24835 samples
Epoch 1/100
99337/99337 [==============================] - 115s 1ms/step - loss: 0.6933 - binary_accuracy: 0.5008 - val_loss: 0.6931 - val_binary_accuracy: 0.4965
Epoch 2/100
99337/99337 [==============================] - 120s 1ms/step - loss: 0.6932 - binary_accuracy: 0.5009 - val_loss: 0.6932 - val_binary_accuracy: 0.4964
Epoch 3/100
99337/99337 [==============================] - 129s 1ms/step - loss: 0.6932 - binary_accuracy: 0.5008 - val_loss: 0.6931 - val_binary_accuracy: 0.5021
Epoch 4/100
99337/99337 [==============================] - 146s 1ms/step - loss: 0.6931 - binary_accuracy: 0.5030 - val_loss: 0.6932 - val_binary_accuracy: 0.4965
Epoch 5/100
99337/99337 [==============================] - 138s 1ms/step - loss: 0.6931 - binary_accuracy: 0.5050 - val_loss: 0.6930 - val_binary_accuracy: 0.5036
Epoch 6/100
99337/99337 [==============================] - 139s 1ms/step - loss: 0.6930 - binary_accuracy: 0.5080 - val_loss: 0.6929 - val_binary_accuracy: 0.5191
Epoch 7/100
99337/99337 [==============================] - 142s 1ms/step - loss: 0.6926 - binary_accuracy: 0.5126 - val_loss: 0.6927 - val_binary_accuracy: 0.5042
Epoch 8/100
99337/99337 [==============================] - 146s 1ms/step - loss: 0.6926 - binary_accuracy: 0.5135 - val_loss: 0.6922 - val_binary_accuracy: 0.5385
Epoch 9/100
99337/99337 [==============================] - 161s 2ms/step - loss: 0.6921 - binary_accuracy: 0.5194 - val_loss: 0.6918 - val_binary_accuracy: 0.5468
Epoch 10/100
99337/99337 [==============================] - 155s 2ms/step - loss: 0.6913 - binary_accuracy: 0.5252 - val_loss: 0.6911 - val_binary_accuracy: 0.5408
Epoch 11/100
99337/99337 [==============================] - 144s 1ms/step - loss: 0.6901 - binary_accuracy: 0.5315 - val_loss: 0.6900 - val_binary_accuracy: 0.5669
Epoch 12/100
99337/99337 [==============================] - 148s 1ms/step - loss: 0.6882 - binary_accuracy: 0.5421 - val_loss: 0.6877 - val_binary_accuracy: 0.5755
Epoch 13/100
99337/99337 [==============================] - 142s 1ms/step - loss: 0.6858 - binary_accuracy: 0.5492 - val_loss: 0.6856 - val_binary_accuracy: 0.5532
Epoch 14/100
99337/99337 [==============================] - 143s 1ms/step - loss: 0.6818 - binary_accuracy: 0.5599 - val_loss: 0.6801 - val_binary_accuracy: 0.5808
Epoch 15/100
99337/99337 [==============================] - 141s 1ms/step - loss: 0.6761 - binary_accuracy: 0.5722 - val_loss: 0.6779 - val_binary_accuracy: 0.5652
Epoch 16/100
99337/99337 [==============================] - 139s 1ms/step - loss: 0.6685 - binary_accuracy: 0.5867 - val_loss: 0.6673 - val_binary_accuracy: 0.5950
Epoch 17/100
99337/99337 [==============================] - 142s 1ms/step - loss: 0.6601 - binary_accuracy: 0.6007 - val_loss: 0.6679 - val_binary_accuracy: 0.5820
Epoch 18/100
99337/99337 [==============================] - 144s 1ms/step - loss: 0.6504 - binary_accuracy: 0.6177 - val_loss: 0.6545 - val_binary_accuracy: 0.6130
Epoch 19/100
99337/99337 [==============================] - 140s 1ms/step - loss: 0.6387 - binary_accuracy: 0.6328 - val_loss: 0.6477 - val_binary_accuracy: 0.6183
Epoch 20/100
99337/99337 [==============================] - 141s 1ms/step - loss: 0.6257 - binary_accuracy: 0.6503 - val_loss: 0.6406 - val_binary_accuracy: 0.6217
Epoch 21/100
99337/99337 [==============================] - 138s 1ms/step - loss: 0.6107 - binary_accuracy: 0.6658 - val_loss: 0.6336 - val_binary_accuracy: 0.6268
Epoch 22/100
99337/99337 [==============================] - 137s 1ms/step - loss: 0.5974 - binary_accuracy: 0.6809 - val_loss: 0.6258 - val_binary_accuracy: 0.6407
Epoch 23/100
99337/99337 [==============================] - 134s 1ms/step - loss: 0.5825 - binary_accuracy: 0.6948 - val_loss: 0.6226 - val_binary_accuracy: 0.6373
Epoch 24/100
99337/99337 [==============================] - 131s 1ms/step - loss: 0.5676 - binary_accuracy: 0.7079 - val_loss: 0.6131 - val_binary_accuracy: 0.6541
Epoch 25/100
99337/99337 [==============================] - 139s 1ms/step - loss: 0.5533 - binary_accuracy: 0.7220 - val_loss: 0.6100 - val_binary_accuracy: 0.6533
Epoch 26/100
99337/99337 [==============================] - 145s 1ms/step - loss: 0.5388 - binary_accuracy: 0.7343 - val_loss: 0.6048 - val_binary_accuracy: 0.6619
Epoch 27/100
99337/99337 [==============================] - 136s 1ms/step - loss: 0.5234 - binary_accuracy: 0.7463 - val_loss: 0.5995 - val_binary_accuracy: 0.6715
Epoch 28/100
99337/99337 [==============================] - 142s 1ms/step - loss: 0.5091 - binary_accuracy: 0.7572 - val_loss: 0.6216 - val_binary_accuracy: 0.6723
Epoch 29/100
99337/99337 [==============================] - 140s 1ms/step - loss: 0.4966 - binary_accuracy: 0.7661 - val_loss: 0.6034 - val_binary_accuracy: 0.6788
Epoch 30/100
99337/99337 [==============================] - 134s 1ms/step - loss: 0.4814 - binary_accuracy: 0.7794 - val_loss: 0.6000 - val_binary_accuracy: 0.6719
Epoch 31/100
99337/99337 [==============================] - 134s 1ms/step - loss: 0.4709 - binary_accuracy: 0.7853 - val_loss: 0.5943 - val_binary_accuracy: 0.6776
Epoch 32/100
99337/99337 [==============================] - 133s 1ms/step - loss: 0.4575 - binary_accuracy: 0.7940 - val_loss: 0.6133 - val_binary_accuracy: 0.6650
Epoch 33/100
99337/99337 [==============================] - 135s 1ms/step - loss: 0.4463 - binary_accuracy: 0.8014 - val_loss: 0.5944 - val_binary_accuracy: 0.6914
Epoch 34/100
99337/99337 [==============================] - 134s 1ms/step - loss: 0.4316 - binary_accuracy: 0.8094 - val_loss: 0.5966 - val_binary_accuracy: 0.6808
Epoch 35/100
99337/99337 [==============================] - 135s 1ms/step - loss: 0.4223 - binary_accuracy: 0.8147 - val_loss: 0.6016 - val_binary_accuracy: 0.6817
Epoch 36/100
99337/99337 [==============================] - 131s 1ms/step - loss: 0.4124 - binary_accuracy: 0.8214 - val_loss: 0.6019 - val_binary_accuracy: 0.7018
Epoch 37/100
99337/99337 [==============================] - 134s 1ms/step - loss: 0.4014 - binary_accuracy: 0.8280 - val_loss: 0.6018 - val_binary_accuracy: 0.6915
Epoch 38/100
99337/99337 [==============================] - 132s 1ms/step - loss: 0.3893 - binary_accuracy: 0.8354 - val_loss: 0.6219 - val_binary_accuracy: 0.6895
Epoch 39/100
99337/99337 [==============================] - 136s 1ms/step - loss: 0.3807 - binary_accuracy: 0.8408 - val_loss: 0.6040 - val_binary_accuracy: 0.6965
Epoch 40/100
99337/99337 [==============================] - 132s 1ms/step - loss: 0.3710 - binary_accuracy: 0.8452 - val_loss: 0.5989 - val_binary_accuracy: 0.7065
Epoch 41/100
99337/99337 [==============================] - 133s 1ms/step - loss: 0.3592 - binary_accuracy: 0.8509 - val_loss: 0.6093 - val_binary_accuracy: 0.7092
Epoch 42/100
99337/99337 [==============================] - 136s 1ms/step - loss: 0.3566 - binary_accuracy: 0.8531 - val_loss: 0.6149 - val_binary_accuracy: 0.7102
Epoch 43/100
99337/99337 [==============================] - 133s 1ms/step - loss: 0.3438 - binary_accuracy: 0.8586 - val_loss: 0.6342 - val_binary_accuracy: 0.6869
Epoch 44/100
99337/99337 [==============================] - 133s 1ms/step - loss: 0.3376 - binary_accuracy: 0.8629 - val_loss: 0.6413 - val_binary_accuracy: 0.6825
Epoch 45/100
99337/99337 [==============================] - 133s 1ms/step - loss: 0.3302 - binary_accuracy: 0.8660 - val_loss: 0.6250 - val_binary_accuracy: 0.7086
Epoch 46/100
99337/99337 [==============================] - 135s 1ms/step - loss: 0.3226 - binary_accuracy: 0.8699 - val_loss: 0.6246 - val_binary_accuracy: 0.7149
Epoch 47/100
99337/99337 [==============================] - 134s 1ms/step - loss: 0.3179 - binary_accuracy: 0.8733 - val_loss: 0.6411 - val_binary_accuracy: 0.6976
Epoch 48/100
99337/99337 [==============================] - 134s 1ms/step - loss: 0.3089 - binary_accuracy: 0.8776 - val_loss: 0.6324 - val_binary_accuracy: 0.7092
Epoch 49/100
99337/99337 [==============================] - 134s 1ms/step - loss: 0.3023 - binary_accuracy: 0.8802 - val_loss: 0.6386 - val_binary_accuracy: 0.7091
Epoch 50/100
99337/99337 [==============================] - 134s 1ms/step - loss: 0.2949 - binary_accuracy: 0.8838 - val_loss: 0.6658 - val_binary_accuracy: 0.7047
Epoch 51/100
99337/99337 [==============================] - 128s 1ms/step - loss: 0.2891 - binary_accuracy: 0.8855 - val_loss: 0.6499 - val_binary_accuracy: 0.7128
Epoch 52/100
99337/99337 [==============================] - 127s 1ms/step - loss: 0.2824 - binary_accuracy: 0.8899 - val_loss: 0.6426 - val_binary_accuracy: 0.7136
Epoch 53/100
99337/99337 [==============================] - 131s 1ms/step - loss: 0.2742 - binary_accuracy: 0.8934 - val_loss: 0.6756 - val_binary_accuracy: 0.7210
Epoch 54/100
99337/99337 [==============================] - 127s 1ms/step - loss: 0.2713 - binary_accuracy: 0.8954 - val_loss: 0.6635 - val_binary_accuracy: 0.7158
Epoch 55/100
99337/99337 [==============================] - 129s 1ms/step - loss: 0.2649 - binary_accuracy: 0.8979 - val_loss: 0.6847 - val_binary_accuracy: 0.7267
Epoch 56/100
99337/99337 [==============================] - 128s 1ms/step - loss: 0.2591 - binary_accuracy: 0.8998 - val_loss: 0.6922 - val_binary_accuracy: 0.7111
Epoch 57/100
99337/99337 [==============================] - 129s 1ms/step - loss: 0.2524 - binary_accuracy: 0.9025 - val_loss: 0.6743 - val_binary_accuracy: 0.7061
Epoch 58/100
99337/99337 [==============================] - 128s 1ms/step - loss: 0.2504 - binary_accuracy: 0.9053 - val_loss: 0.7038 - val_binary_accuracy: 0.7046
Epoch 59/100
99337/99337 [==============================] - 131s 1ms/step - loss: 0.2454 - binary_accuracy: 0.9076 - val_loss: 0.6953 - val_binary_accuracy: 0.7257
Epoch 60/100
99337/99337 [==============================] - 136s 1ms/step - loss: 0.2440 - binary_accuracy: 0.9077 - val_loss: 0.7036 - val_binary_accuracy: 0.6999
Epoch 61/100
99337/99337 [==============================] - 131s 1ms/step - loss: 0.2355 - binary_accuracy: 0.9119 - val_loss: 0.7006 - val_binary_accuracy: 0.7199
Epoch 62/100
99337/99337 [==============================] - 131s 1ms/step - loss: 0.2317 - binary_accuracy: 0.9135 - val_loss: 0.7084 - val_binary_accuracy: 0.7258
Epoch 63/100
99337/99337 [==============================] - 130s 1ms/step - loss: 0.2284 - binary_accuracy: 0.9148 - val_loss: 0.7201 - val_binary_accuracy: 0.7117
Epoch 64/100
99337/99337 [==============================] - 139s 1ms/step - loss: 0.2228 - binary_accuracy: 0.9173 - val_loss: 0.6994 - val_binary_accuracy: 0.7267
Epoch 65/100
99337/99337 [==============================] - 136s 1ms/step - loss: 0.2192 - binary_accuracy: 0.9192 - val_loss: 0.7112 - val_binary_accuracy: 0.7254
Epoch 66/100
99337/99337 [==============================] - 145s 1ms/step - loss: 0.2180 - binary_accuracy: 0.9191 - val_loss: 0.7455 - val_binary_accuracy: 0.7090
Epoch 67/100
99337/99337 [==============================] - 139s 1ms/step - loss: 0.2140 - binary_accuracy: 0.9212 - val_loss: 0.7245 - val_binary_accuracy: 0.7216
Epoch 68/100
99337/99337 [==============================] - 134s 1ms/step - loss: 0.2099 - binary_accuracy: 0.9227 - val_loss: 0.7473 - val_binary_accuracy: 0.7127
Epoch 69/100
99337/99337 [==============================] - 151s 2ms/step - loss: 0.2086 - binary_accuracy: 0.9241 - val_loss: 0.7325 - val_binary_accuracy: 0.7178
Epoch 70/100
99337/99337 [==============================] - 142s 1ms/step - loss: 0.2035 - binary_accuracy: 0.9257 - val_loss: 0.7511 - val_binary_accuracy: 0.7146
Epoch 71/100
99337/99337 [==============================] - 144s 1ms/step - loss: 0.1994 - binary_accuracy: 0.9262 - val_loss: 0.7580 - val_binary_accuracy: 0.7164
Epoch 72/100
99337/99337 [==============================] - 151s 2ms/step - loss: 0.1955 - binary_accuracy: 0.9287 - val_loss: 0.7656 - val_binary_accuracy: 0.7141
Epoch 73/100
99337/99337 [==============================] - 149s 1ms/step - loss: 0.1943 - binary_accuracy: 0.9291 - val_loss: 0.7731 - val_binary_accuracy: 0.7124
Epoch 74/100
99337/99337 [==============================] - 146s 1ms/step - loss: 0.1917 - binary_accuracy: 0.9304 - val_loss: 0.7876 - val_binary_accuracy: 0.7151
Epoch 75/100
99337/99337 [==============================] - 148s 1ms/step - loss: 0.1891 - binary_accuracy: 0.9315 - val_loss: 0.7685 - val_binary_accuracy: 0.7237
Epoch 76/100
99337/99337 [==============================] - 149s 2ms/step - loss: 0.1861 - binary_accuracy: 0.9332 - val_loss: 0.7916 - val_binary_accuracy: 0.7059
Epoch 77/100
99337/99337 [==============================] - 141s 1ms/step - loss: 0.1834 - binary_accuracy: 0.9338 - val_loss: 0.7786 - val_binary_accuracy: 0.7177
Epoch 78/100
99337/99337 [==============================] - 147s 1ms/step - loss: 0.1841 - binary_accuracy: 0.9335 - val_loss: 0.7906 - val_binary_accuracy: 0.7199
Epoch 79/100
99337/99337 [==============================] - 143s 1ms/step - loss: 0.1781 - binary_accuracy: 0.9350 - val_loss: 0.7817 - val_binary_accuracy: 0.7238
Epoch 80/100
99337/99337 [==============================] - 142s 1ms/step - loss: 0.1762 - binary_accuracy: 0.9366 - val_loss: 0.8084 - val_binary_accuracy: 0.7117
Epoch 81/100
99337/99337 [==============================] - 141s 1ms/step - loss: 0.1750 - binary_accuracy: 0.9376 - val_loss: 0.7958 - val_binary_accuracy: 0.7251
Epoch 82/100
99337/99337 [==============================] - 139s 1ms/step - loss: 0.1732 - binary_accuracy: 0.9382 - val_loss: 0.7891 - val_binary_accuracy: 0.7194
Epoch 83/100
99337/99337 [==============================] - 134s 1ms/step - loss: 0.1687 - binary_accuracy: 0.9398 - val_loss: 0.8058 - val_binary_accuracy: 0.7196
Epoch 84/100
99337/99337 [==============================] - 133s 1ms/step - loss: 0.1667 - binary_accuracy: 0.9407 - val_loss: 0.8207 - val_binary_accuracy: 0.7067
Epoch 85/100
99337/99337 [==============================] - 137s 1ms/step - loss: 0.1669 - binary_accuracy: 0.9406 - val_loss: 0.8013 - val_binary_accuracy: 0.7220
Epoch 86/100
99337/99337 [==============================] - 145s 1ms/step - loss: 0.1630 - binary_accuracy: 0.9426 - val_loss: 0.8102 - val_binary_accuracy: 0.7204
Epoch 87/100
99337/99337 [==============================] - 143s 1ms/step - loss: 0.1624 - binary_accuracy: 0.9430 - val_loss: 0.8240 - val_binary_accuracy: 0.7162
Epoch 88/100
99337/99337 [==============================] - 139s 1ms/step - loss: 0.1607 - binary_accuracy: 0.9432 - val_loss: 0.8322 - val_binary_accuracy: 0.7255
Epoch 89/100
99337/99337 [==============================] - 138s 1ms/step - loss: 0.1583 - binary_accuracy: 0.9441 - val_loss: 0.8350 - val_binary_accuracy: 0.7255
Epoch 90/100
99337/99337 [==============================] - 138s 1ms/step - loss: 0.1552 - binary_accuracy: 0.9457 - val_loss: 0.8296 - val_binary_accuracy: 0.7167
Epoch 91/100
99337/99337 [==============================] - 141s 1ms/step - loss: 0.1540 - binary_accuracy: 0.9457 - val_loss: 0.8500 - val_binary_accuracy: 0.7182
Epoch 92/100
99337/99337 [==============================] - 143s 1ms/step - loss: 0.1532 - binary_accuracy: 0.9464 - val_loss: 0.8544 - val_binary_accuracy: 0.7067
Epoch 93/100
99337/99337 [==============================] - 139s 1ms/step - loss: 0.1481 - binary_accuracy: 0.9482 - val_loss: 0.8498 - val_binary_accuracy: 0.7247
Epoch 94/100
99337/99337 [==============================] - 134s 1ms/step - loss: 0.1498 - binary_accuracy: 0.9479 - val_loss: 0.8295 - val_binary_accuracy: 0.7224
Epoch 95/100
99337/99337 [==============================] - 135s 1ms/step - loss: 0.1497 - binary_accuracy: 0.9482 - val_loss: 0.8625 - val_binary_accuracy: 0.7173
Epoch 96/100
99337/99337 [==============================] - 133s 1ms/step - loss: 0.1478 - binary_accuracy: 0.9486 - val_loss: 0.8555 - val_binary_accuracy: 0.7138
Epoch 97/100
99337/99337 [==============================] - 135s 1ms/step - loss: 0.1457 - binary_accuracy: 0.9497 - val_loss: 0.8751 - val_binary_accuracy: 0.7155
Epoch 98/100
99337/99337 [==============================] - 133s 1ms/step - loss: 0.1443 - binary_accuracy: 0.9499 - val_loss: 0.8601 - val_binary_accuracy: 0.7162
Epoch 99/100
99337/99337 [==============================] - 142s 1ms/step - loss: 0.1423 - binary_accuracy: 0.9512 - val_loss: 0.8776 - val_binary_accuracy: 0.7115
Epoch 100/100
99337/99337 [==============================] - 143s 1ms/step - loss: 0.1401 - binary_accuracy: 0.9506 - val_loss: 0.8790 - val_binary_accuracy: 0.7236
In [56]:
import matplotlib.pyplot as plt

plt.figure(figsize=(20,15))
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss', fontsize = 20)
plt.ylabel('Loss', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()

plt.figure(figsize=(20,15))
plt.plot(history.history['binary_accuracy'])
plt.plot(history.history['val_binary_accuracy'])
plt.title('Model Accuracy', fontsize = 20)
plt.ylabel('Accuracy', fontsize = 20)
plt.xlabel('Epoch', fontsize = 20)
plt.legend(['Train', 'Validation'], fontsize = 20)
plt.show()
In [57]:
scores = model.evaluate(test_features, test_labels, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
Accuracy: 72.51%
In [120]:
import keras.backend as K

def get_layer_output_grad(model, inputs, outputs, layer=-1):
    """ Gets gradient a layer output for given inputs and outputs"""
    grads = model.optimizer.get_gradients(model.total_loss, model.layers[layer].output)
    symb_inputs = (model._feed_inputs + model._feed_targets + model._feed_sample_weights)
    f = K.function(symb_inputs, grads)
    x, y, sample_weight = model._standardize_user_data(inputs, outputs)
    output_grad = f(x + y + sample_weight)
    return output_grad
In [230]:
import numpy as np
import keras
from keras import backend as K

#model = keras.Sequential()
#model.add(keras.layers.Dense(20, input_shape = (10, )))
#model.add(keras.layers.Dense(5))
#model.compile('adam', 'mse')

dummy_in = np.ones((4, train_features.shape[1]))
#dummy_in = train_features[0].reshape(1,-1)
dummy_out = np.ones((4, 1))
#dummy_out = np.ones((1, 1))

#dummy_in = np.ones((4, 10))
#dummy_out = np.ones((4, 5))
dummy_loss = model.train_on_batch(dummy_in, dummy_out)

output_grad = get_layer_output_grad(model, dummy_in, dummy_out)
print(dummy_in)
print(dummy_out)
print(output_grad)
[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
  1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
  1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
  1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
  1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
  1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
  1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
  1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
  1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
  1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
  1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
  1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
  1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]
[[1.]
 [1.]
 [1.]
 [1.]]
[array([[-0.48958904],
       [-0.48958904],
       [-0.48958904],
       [-0.48958904]], dtype=float32)]
In [231]:
dummy_in.shape
Out[231]:
(4, 96)
In [232]:
dummy_out.shape
Out[232]:
(4, 1)
In [233]:
np.array(output_grad).shape
Out[233]:
(1, 4, 1)
In [ ]:
 
In [ ]: